Home Credit Default Risk (HCDR) Project Phase 3¶

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges¶

  1. Dataset size
    • (688 meg compressed) with millions of rows of data
    • 2.71 Gig of data uncompressed
  • Dealing with missing data
  • Imbalanced datasets
  • Summarizing transaction data

Dataset¶

Back ground Home Credit Group¶

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group¶

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset¶

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview¶

The HomeCredit_columns_description.csv acts as a data dictioanry.

There are 7 different sources of data:

  • application_train/application_test (307k rows, and 48k rows): the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  • bureau (1.7 Million rows): data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance (27 Million rows): monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application (1.6 Million rows): previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE (10 Million rows): monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment (13.6 Million rows): payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Table sizes¶

name                       [  rows cols]     MegaBytes         
-----------------------  ------------------  -------
application_train       : [  307,511, 122]:   158MB
application_test        : [   48,744, 121]:   25MB
bureau                  : [ 1,716,428, 17]    162MB
bureau_balance          : [ 27,299,925, 3]:   358MB
credit_card_balance     : [  3,840,312, 23]   405MB
installments_payments   : [ 13,605,401, 8]    690MB
previous_application    : [  1,670,214, 37]   386MB
POS_CASH_balance        : [ 10,001,358, 8]    375MB

image.png

In [1]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

Data imports¶

In [2]:
import os
import pandas as pd

files = ["HCDR/" + f for f in os.listdir("HCDR")]
datasets = {}

for f in files:
    name = f.split("/")[1].split(".")[0]
    print(f"Loading {name}")
    datasets[name] = pd.read_csv(f, encoding='latin-1')

print()
print(datasets.keys())
Loading credit_card_balance
Loading installments_payments
Loading bureau_balance
Loading application_train
Loading POS_CASH_balance
Loading application_test
Loading bureau
Loading previous_application

dict_keys(['credit_card_balance', 'installments_payments', 'bureau_balance', 'application_train', 'POS_CASH_balance', 'application_test', 'bureau', 'previous_application'])

Data files overview¶

Data Dictionary¶

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Data Dictionary provided by data download¶

Table Row Description Special
application_{train test}.csv SK_ID_CURR ID of loan in our sample
application_{train test}.csv TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)
application_{train test}.csv NAME_CONTRACT_TYPE Identification if loan is cash or revolving
application_{train test}.csv CODE_GENDER Gender of the client
application_{train test}.csv FLAG_OWN_CAR Flag if the client owns a car
application_{train test}.csv FLAG_OWN_REALTY Flag if client owns a house or flat
application_{train test}.csv CNT_CHILDREN Number of children the client has
application_{train test}.csv AMT_INCOME_TOTAL Income of the client
application_{train test}.csv AMT_CREDIT Credit amount of the loan
application_{train test}.csv AMT_ANNUITY Loan annuity
application_{train test}.csv AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given
application_{train test}.csv NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan
application_{train test}.csv NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…)
application_{train test}.csv NAME_EDUCATION_TYPE Level of highest education the client achieved
application_{train test}.csv NAME_FAMILY_STATUS Family status of the client
application_{train test}.csv NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...)
application_{train test}.csv REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) normalized
application_{train test}.csv DAYS_BIRTH Client's age in days at the time of application time only relative to the application
application_{train test}.csv DAYS_EMPLOYED How many days before the application the person started current employment time only relative to the application
application_{train test}.csv DAYS_REGISTRATION How many days before the application did client change his registration time only relative to the application
application_{train test}.csv DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan time only relative to the application
application_{train test}.csv OWN_CAR_AGE Age of client's car
application_{train test}.csv FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO)
application_{train test}.csv FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO)
application_{train test}.csv FLAG_WORK_PHONE Did client provide home phone (1=YES, 0=NO)
application_{train test}.csv FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO)
application_{train test}.csv FLAG_PHONE Did client provide home phone (1=YES, 0=NO)
application_{train test}.csv FLAG_EMAIL Did client provide email (1=YES, 0=NO)
application_{train test}.csv OCCUPATION_TYPE What kind of occupation does the client have
application_{train test}.csv CNT_FAM_MEMBERS How many family members does client have
application_{train test}.csv REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3)
application_{train test}.csv REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3)
application_{train test}.csv WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan
application_{train test}.csv HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan rounded
application_{train test}.csv REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level)
application_{train test}.csv REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level)
application_{train test}.csv LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level)
application_{train test}.csv REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level)
application_{train test}.csv REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level)
application_{train test}.csv LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level)
application_{train test}.csv ORGANIZATION_TYPE Type of organization where client works
application_{train test}.csv EXT_SOURCE_1 Normalized score from external data source normalized
application_{train test}.csv EXT_SOURCE_2 Normalized score from external data source normalized
application_{train test}.csv EXT_SOURCE_3 Normalized score from external data source normalized
application_{train test}.csv APARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv BASEMENTAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BEGINEXPLUATATION_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BUILD_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv COMMONAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ELEVATORS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ENTRANCES_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMAX_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMIN_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LANDAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAPARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAPARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv APARTMENTS_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv BASEMENTAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BEGINEXPLUATATION_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BUILD_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv COMMONAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ELEVATORS_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ENTRANCES_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMAX_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMIN_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LANDAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAPARTMENTS_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAPARTMENTS_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv APARTMENTS_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv BASEMENTAREA_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BEGINEXPLUATATION_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv YEARS_BUILD_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv COMMONAREA_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ELEVATORS_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv ENTRANCES_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMAX_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FLOORSMIN_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LANDAREA_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAPARTMENTS_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv LIVINGAREA_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAPARTMENTS_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv NONLIVINGAREA_MEDI Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv FONDKAPREMONT_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv HOUSETYPE_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv TOTALAREA_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv WALLSMATERIAL_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv EMERGENCYSTATE_MODE Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor normalized
application_{train test}.csv OBS_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 30 DPD (days past due) default
application_{train test}.csv DEF_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 30 DPD (days past due)
application_{train test}.csv OBS_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 60 DPD (days past due) default
application_{train test}.csv DEF_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 60 (days past due) DPD
application_{train test}.csv DAYS_LAST_PHONE_CHANGE How many days before application did client change phone
application_{train test}.csv FLAG_DOCUMENT_2 Did client provide document 2
application_{train test}.csv FLAG_DOCUMENT_3 Did client provide document 3
application_{train test}.csv FLAG_DOCUMENT_4 Did client provide document 4
application_{train test}.csv FLAG_DOCUMENT_5 Did client provide document 5
application_{train test}.csv FLAG_DOCUMENT_6 Did client provide document 6
application_{train test}.csv FLAG_DOCUMENT_7 Did client provide document 7
application_{train test}.csv FLAG_DOCUMENT_8 Did client provide document 8
application_{train test}.csv FLAG_DOCUMENT_9 Did client provide document 9
application_{train test}.csv FLAG_DOCUMENT_10 Did client provide document 10
application_{train test}.csv FLAG_DOCUMENT_11 Did client provide document 11
application_{train test}.csv FLAG_DOCUMENT_12 Did client provide document 12
application_{train test}.csv FLAG_DOCUMENT_13 Did client provide document 13
application_{train test}.csv FLAG_DOCUMENT_14 Did client provide document 14
application_{train test}.csv FLAG_DOCUMENT_15 Did client provide document 15
application_{train test}.csv FLAG_DOCUMENT_16 Did client provide document 16
application_{train test}.csv FLAG_DOCUMENT_17 Did client provide document 17
application_{train test}.csv FLAG_DOCUMENT_18 Did client provide document 18
application_{train test}.csv FLAG_DOCUMENT_19 Did client provide document 19
application_{train test}.csv FLAG_DOCUMENT_20 Did client provide document 20
application_{train test}.csv FLAG_DOCUMENT_21 Did client provide document 21
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_HOUR Number of enquiries to Credit Bureau about the client one hour before application
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_DAY Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application)
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_WEEK Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application)
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_MON Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application)
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_QRT Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application)
application_{train test}.csv AMT_REQ_CREDIT_BUREAU_YEAR Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application)
bureau.csv SK_ID_CURR ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau hashed
bureau.csv SK_BUREAU_ID Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application) hashed
bureau.csv CREDIT_ACTIVE Status of the Credit Bureau (CB) reported credits
bureau.csv CREDIT_CURRENCY Recoded currency of the Credit Bureau credit recoded
bureau.csv DAYS_CREDIT How many days before current application did client apply for Credit Bureau credit time only relative to the application
bureau.csv CREDIT_DAY_OVERDUE Number of days past due on CB credit at the time of application for related loan in our sample
bureau.csv DAYS_CREDIT_ENDDATE Remaining duration of CB credit (in days) at the time of application in Home Credit time only relative to the application
bureau.csv DAYS_ENDDATE_FACT Days since CB credit ended at the time of application in Home Credit (only for closed credit) time only relative to the application
bureau.csv AMT_CREDIT_MAX_OVERDUE Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)
bureau.csv CNT_CREDIT_PROLONG How many times was the Credit Bureau credit prolonged
bureau.csv AMT_CREDIT_SUM Current credit amount for the Credit Bureau credit
bureau.csv AMT_CREDIT_SUM_DEBT Current debt on Credit Bureau credit
bureau.csv AMT_CREDIT_SUM_LIMIT Current credit limit of credit card reported in Credit Bureau
bureau.csv AMT_CREDIT_SUM_OVERDUE Current amount overdue on Credit Bureau credit
bureau.csv CREDIT_TYPE Type of Credit Bureau credit (Car, cash,...)
bureau.csv DAYS_CREDIT_UPDATE How many days before loan application did last information about the Credit Bureau credit come time only relative to the application
bureau.csv AMT_ANNUITY Annuity of the Credit Bureau credit
bureau_balance.csv SK_BUREAU_ID Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table hashed
bureau_balance.csv MONTHS_BALANCE Month of balance relative to application date (-1 means the freshest balance date) time only relative to the application
bureau_balance.csv STATUS Status of Credit Bureau loan during the month (active, closed, DPD0-30,… [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,… 5 means DPD 120+ or sold or written off ] )
POS_CASH_balance.csv SK_ID_PREV ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
POS_CASH_balance.csv SK_ID_CURR ID of loan in our sample
POS_CASH_balance.csv MONTHS_BALANCE Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) time only relative to the application
POS_CASH_balance.csv CNT_INSTALMENT Term of previous credit (can change over time)
POS_CASH_balance.csv CNT_INSTALMENT_FUTURE Installments left to pay on the previous credit
POS_CASH_balance.csv NAME_CONTRACT_STATUS Contract status during the month
POS_CASH_balance.csv SK_DPD DPD (days past due) during the month of previous credit
POS_CASH_balance.csv SK_DPD_DEF DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
credit_card_balance.csv SK_ID_PREV ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) hashed
credit_card_balance.csv SK_ID_CURR ID of loan in our sample hashed
credit_card_balance.csv MONTHS_BALANCE Month of balance relative to application date (-1 means the freshest balance date) time only relative to the application
credit_card_balance.csv AMT_BALANCE Balance during the month of previous credit
credit_card_balance.csv AMT_CREDIT_LIMIT_ACTUAL Credit card limit during the month of the previous credit
credit_card_balance.csv AMT_DRAWINGS_ATM_CURRENT Amount drawing at ATM during the month of the previous credit
credit_card_balance.csv AMT_DRAWINGS_CURRENT Amount drawing during the month of the previous credit
credit_card_balance.csv AMT_DRAWINGS_OTHER_CURRENT Amount of other drawings during the month of the previous credit
credit_card_balance.csv AMT_DRAWINGS_POS_CURRENT Amount drawing or buying goods during the month of the previous credit
credit_card_balance.csv AMT_INST_MIN_REGULARITY Minimal installment for this month of the previous credit
credit_card_balance.csv AMT_PAYMENT_CURRENT How much did the client pay during the month on the previous credit
credit_card_balance.csv AMT_PAYMENT_TOTAL_CURRENT How much did the client pay during the month in total on the previous credit
credit_card_balance.csv AMT_RECEIVABLE_PRINCIPAL Amount receivable for principal on the previous credit
credit_card_balance.csv AMT_RECIVABLE Amount receivable on the previous credit
credit_card_balance.csv AMT_TOTAL_RECEIVABLE Total amount receivable on the previous credit
credit_card_balance.csv CNT_DRAWINGS_ATM_CURRENT Number of drawings at ATM during this month on the previous credit
credit_card_balance.csv CNT_DRAWINGS_CURRENT Number of drawings during this month on the previous credit
credit_card_balance.csv CNT_DRAWINGS_OTHER_CURRENT Number of other drawings during this month on the previous credit
credit_card_balance.csv CNT_DRAWINGS_POS_CURRENT Number of drawings for goods during this month on the previous credit
credit_card_balance.csv CNT_INSTALMENT_MATURE_CUM Number of paid installments on the previous credit
credit_card_balance.csv NAME_CONTRACT_STATUS Contract status (active signed,...) on the previous credit
credit_card_balance.csv SK_DPD DPD (Days past due) during the month on the previous credit
credit_card_balance.csv SK_DPD_DEF DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
previous_application.csv SK_ID_PREV ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loan applications in Home Credit, previous application could, but not necessarily have to lead to credit) hashed
previous_application.csv SK_ID_CURR ID of loan in our sample hashed
previous_application.csv NAME_CONTRACT_TYPE Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application
previous_application.csv AMT_ANNUITY Annuity of previous application
previous_application.csv AMT_APPLICATION For how much credit did client ask on the previous application
previous_application.csv AMT_CREDIT Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amount for which the client initially applied for, but during our approval process he could have received different amount - AMT_CREDIT
previous_application.csv AMT_DOWN_PAYMENT Down payment on the previous application
previous_application.csv AMT_GOODS_PRICE Goods price of good that client asked for (if applicable) on the previous application
previous_application.csv WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for previous application
previous_application.csv HOUR_APPR_PROCESS_START Approximately at what day hour did the client apply for the previous application rounded
previous_application.csv FLAG_LAST_APPL_PER_CONTRACT Flag if it was last application for the previous contract. Sometimes by mistake of client or our clerk there could be more applications for one single contract
previous_application.csv NFLAG_LAST_APPL_IN_DAY Flag if the application was the last application per day of the client. Sometimes clients apply for more applications a day. Rarely it could also be error in our system that one application is in the database twice
previous_application.csv NFLAG_MICRO_CASH Flag Micro finance loan
previous_application.csv RATE_DOWN_PAYMENT Down payment rate normalized on previous credit normalized
previous_application.csv RATE_INTEREST_PRIMARY Interest rate normalized on previous credit normalized
previous_application.csv RATE_INTEREST_PRIVILEGED Interest rate normalized on previous credit normalized
previous_application.csv NAME_CASH_LOAN_PURPOSE Purpose of the cash loan
previous_application.csv NAME_CONTRACT_STATUS Contract status (approved, cancelled, ...) of previous application
previous_application.csv DAYS_DECISION Relative to current application when was the decision about previous application made time only relative to the application
previous_application.csv NAME_PAYMENT_TYPE Payment method that client chose to pay for the previous application
previous_application.csv CODE_REJECT_REASON Why was the previous application rejected
previous_application.csv NAME_TYPE_SUITE Who accompanied client when applying for the previous application
previous_application.csv NAME_CLIENT_TYPE Was the client old or new client when applying for the previous application
previous_application.csv NAME_GOODS_CATEGORY What kind of goods did the client apply for in the previous application
previous_application.csv NAME_PORTFOLIO Was the previous application for CASH, POS, CAR, …
previous_application.csv NAME_PRODUCT_TYPE Was the previous application x-sell o walk-in
previous_application.csv CHANNEL_TYPE Through which channel we acquired the client on the previous application
previous_application.csv SELLERPLACE_AREA Selling area of seller place of the previous application
previous_application.csv NAME_SELLER_INDUSTRY The industry of the seller
previous_application.csv CNT_PAYMENT Term of previous credit at application of the previous application
previous_application.csv NAME_YIELD_GROUP Grouped interest rate into small medium and high of the previous application grouped
previous_application.csv PRODUCT_COMBINATION Detailed product combination of the previous application
previous_application.csv DAYS_FIRST_DRAWING Relative to application date of current application when was the first disbursement of the previous application time only relative to the application
previous_application.csv DAYS_FIRST_DUE Relative to application date of current application when was the first due supposed to be of the previous application time only relative to the application
previous_application.csv DAYS_LAST_DUE_1ST_VERSION Relative to application date of current application when was the first due of the previous application time only relative to the application
previous_application.csv DAYS_LAST_DUE Relative to application date of current application when was the last due date of the previous application time only relative to the application
previous_application.csv DAYS_TERMINATION Relative to application date of current application when was the expected termination of the previous application time only relative to the application
previous_application.csv NFLAG_INSURED_ON_APPROVAL Did the client requested insurance during the previous application
installments_payments.csv SK_ID_PREV ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) hashed
installments_payments.csv SK_ID_CURR ID of loan in our sample hashed
installments_payments.csv NUM_INSTALMENT_VERSION Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed
installments_payments.csv NUM_INSTALMENT_NUMBER On which installment we observe payment
installments_payments.csv DAYS_INSTALMENT When the installment of previous credit was supposed to be paid (relative to application date of current loan) time only relative to the application
installments_payments.csv DAYS_ENTRY_PAYMENT When was the installments of previous credit paid actually (relative to application date of current loan) time only relative to the application
installments_payments.csv AMT_INSTALMENT What was the prescribed installment amount of previous credit on this installment
installments_payments.csv AMT_PAYMENT What the client actually paid on previous credit on this installment

Join the unlabeled dataset (i.e., the submission file)¶

In [3]:
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

1,670,214 rows, 37 columns
In [4]:
# Create aggregate features (via pipeline)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
    def __init__(self, features=None): # no *args or **kargs
        self.features = features
        self.agg_op_features = {}
        for f in features:
#             self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
            self.agg_op_features[f] =  ["min", "max", "mean"]

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        #from IPython.core.debugger import Pdb as pdb;    pdb().set_trace() #breakpoint; dont forget to quit         
        result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
#         result.columns = result.columns.droplevel()
        result.columns = ["_".join(x) for x in result.columns.ravel()]

        result = result.reset_index(level=["SK_ID_CURR"])
        result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
        return result # return dataframe with the join key "SK_ID_CURR"
    

from sklearn.pipeline import make_pipeline 
def test_driver_prevAppsFeaturesAggregater(df, features):
    print(f"df.shape: {df.shape}\n")
    print(f"df[{features}][0:5]: \n{df[features][0:5]}")
    test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
    return(test_pipeline.fit_transform(df))
         
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
features = ['AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CNT_PAYMENT', 
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION']
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print(f"HELLO")
print(f"Test driver: \n{res[0:10]}")
print(f"input[features][0:10]: \n{appsDF[0:10]}")
df.shape: (1670214, 37)

df[['AMT_ANNUITY', 'AMT_APPLICATION']][0:5]: 
   AMT_ANNUITY  AMT_APPLICATION
0     1730.430          17145.0
1    25188.615         607500.0
2    15060.735         112500.0
3    47041.335         450000.0
4    31924.395         337500.0
HELLO
Test driver: 
   SK_ID_CURR  AMT_ANNUITY_min  AMT_ANNUITY_max  AMT_ANNUITY_mean  \
0      100001         3951.000         3951.000       3951.000000   
1      100002         9251.775         9251.775       9251.775000   
2      100003         6737.310        98356.995      56553.990000   
3      100004         5357.250         5357.250       5357.250000   
4      100005         4813.200         4813.200       4813.200000   
5      100006         2482.920        39954.510      23651.175000   
6      100007         1834.290        22678.785      12278.805000   
7      100008         8019.090        25309.575      15839.696250   
8      100009         7435.845        17341.605      10051.412143   
9      100010        27463.410        27463.410      27463.410000   

   AMT_APPLICATION_min  AMT_APPLICATION_max  AMT_APPLICATION_mean  \
0              24835.5              24835.5          24835.500000   
1             179055.0             179055.0         179055.000000   
2              68809.5             900000.0         435436.500000   
3              24282.0              24282.0          24282.000000   
4                  0.0              44617.5          22308.750000   
5                  0.0             688500.0         272203.260000   
6              17176.5             247500.0         150530.250000   
7                  0.0             450000.0         155701.800000   
8              40455.0             110160.0          76741.714286   
9             247212.0             247212.0         247212.000000   

   range_AMT_APPLICATION  
0                    0.0  
1                    0.0  
2               831190.5  
3                    0.0  
4                44617.5  
5               688500.0  
6               230323.5  
7               450000.0  
8                69705.0  
9                    0.0  
input[features][0:10]: 
   SK_ID_PREV  SK_ID_CURR NAME_CONTRACT_TYPE  AMT_ANNUITY  AMT_APPLICATION  \
0     2030495      271877     Consumer loans     1730.430          17145.0   
1     2802425      108129         Cash loans    25188.615         607500.0   
2     2523466      122040         Cash loans    15060.735         112500.0   
3     2819243      176158         Cash loans    47041.335         450000.0   
4     1784265      202054         Cash loans    31924.395         337500.0   
5     1383531      199383         Cash loans    23703.930         315000.0   
6     2315218      175704         Cash loans          NaN              0.0   
7     1656711      296299         Cash loans          NaN              0.0   
8     2367563      342292         Cash loans          NaN              0.0   
9     2579447      334349         Cash loans          NaN              0.0   

   AMT_CREDIT  AMT_DOWN_PAYMENT  AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START  \
0     17145.0               0.0          17145.0                   SATURDAY   
1    679671.0               NaN         607500.0                   THURSDAY   
2    136444.5               NaN         112500.0                    TUESDAY   
3    470790.0               NaN         450000.0                     MONDAY   
4    404055.0               NaN         337500.0                   THURSDAY   
5    340573.5               NaN         315000.0                   SATURDAY   
6         0.0               NaN              NaN                    TUESDAY   
7         0.0               NaN              NaN                     MONDAY   
8         0.0               NaN              NaN                     MONDAY   
9         0.0               NaN              NaN                   SATURDAY   

   HOUR_APPR_PROCESS_START  ... NAME_SELLER_INDUSTRY  CNT_PAYMENT  \
0                       15  ...         Connectivity         12.0   
1                       11  ...                  XNA         36.0   
2                       11  ...                  XNA         12.0   
3                        7  ...                  XNA         12.0   
4                        9  ...                  XNA         24.0   
5                        8  ...                  XNA         18.0   
6                       11  ...                  XNA          NaN   
7                        7  ...                  XNA          NaN   
8                       15  ...                  XNA          NaN   
9                       15  ...                  XNA          NaN   

   NAME_YIELD_GROUP       PRODUCT_COMBINATION  DAYS_FIRST_DRAWING  \
0            middle  POS mobile with interest            365243.0   
1        low_action          Cash X-Sell: low            365243.0   
2              high         Cash X-Sell: high            365243.0   
3            middle       Cash X-Sell: middle            365243.0   
4              high         Cash Street: high                 NaN   
5        low_normal          Cash X-Sell: low            365243.0   
6               XNA                      Cash                 NaN   
7               XNA                      Cash                 NaN   
8               XNA                      Cash                 NaN   
9               XNA                      Cash                 NaN   

  DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION  DAYS_LAST_DUE DAYS_TERMINATION  \
0          -42.0                     300.0          -42.0            -37.0   
1         -134.0                     916.0       365243.0         365243.0   
2         -271.0                      59.0       365243.0         365243.0   
3         -482.0                    -152.0         -182.0           -177.0   
4            NaN                       NaN            NaN              NaN   
5         -654.0                    -144.0         -144.0           -137.0   
6            NaN                       NaN            NaN              NaN   
7            NaN                       NaN            NaN              NaN   
8            NaN                       NaN            NaN              NaN   
9            NaN                       NaN            NaN              NaN   

  NFLAG_INSURED_ON_APPROVAL  
0                       0.0  
1                       1.0  
2                       1.0  
3                       1.0  
4                       NaN  
5                       1.0  
6                       NaN  
7                       NaN  
8                       NaN  
9                       NaN  

[10 rows x 37 columns]
In [5]:
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
prevApps_feature_pipeline = Pipeline([
#         ('prevApps_add_features1', prevApps_add_features1()),  # add some new features 
#         ('prevApps_add_features2', prevApps_add_features2()),  # add some new features
#         ('prevApps_aggregater', prevAppsFeaturesAggregater()), # Aggregate across old and new features
            ('prevApps_aggregater', prevAppsFeaturesAggregater(features)), # Aggregate across old and new features

    ])


X_train= datasets["application_train"] #primary dataset
appsDF = datasets["previous_application"] #prev app


merge_all_data = False
# transform all the secondary tables
# 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 
# 'previous_application', 'POS_CASH_balance'
In [6]:
if merge_all_data:
    prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
    
    #'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 
    # 'previous_application', 'POS_CASH_balance'

    bureau_aggregated = bureau_feature_pipeline.transform(bureau_DF)
    bureau_bal_aggregated = bureau_bal_feature_pipeline.transform(bureau_bal)
    cc_bal_aggregated = cc_bal_feature_pipeline.transform(cc_bal_DF)
    install_pmt_aggregated = install_pmt_feature_pipeline.transform(install_pmt_DF)
    POS_cash_bal_aggregated = POS_cash_bal_feature_pipeline.transform(POS_cash_DF)
In [7]:
X_kaggle_test= datasets["application_test"]
if merge_all_data:
    # 1. Join/Merge in prevApps Data
    X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')

    # 2. Join/Merge in bureau Data
    X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on='SK_ID_CURR')

    # 3. Join/Merge in bureau_balance Data
    X_kaggle_test = X_kaggle_test.merge(bureau_bal_aggregated, how='left', on='SK_ID_CURR')


    # 4. Join/Merge in credit_card_balance Data
    X_kaggle_test = X_kaggle_test.merge(cc_bal_aggregated, how='left', on='SK_ID_CURR')


    # 5. Join/Merge in installments_payments Data
    X_kaggle_test = X_kaggle_test.merge(install_pmt_aggregated, how='left', on='SK_ID_CURR')
    
    # 6. Join/Merge in POS_cash_balance Data
    X_kaggle_test = X_kaggle_test.merge(POS_cash_bal_aggregated, how='left', on='SK_ID_CURR')

Distribution of Target¶

We see a very unbalanced target variable. Clearly, most people repaid their loans. This is exactly why we intend to use the F-1 Score, which can handle this imbalance as a scoring metric.

In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
train = datasets['application_train']
plt.figure(figsize=(10,6))
train["TARGET"].astype(int).plot.hist()
plt.title("Distribution of Target")
plt.xlabel("Target Class - 0: the loan was repaid / 1: the loan was not repaid")
Out[8]:
Text(0.5, 0, 'Target Class - 0: the loan was repaid / 1: the loan was not repaid')

Feature Selection for Application_training data¶

Print Info and Description Summaries of Files¶

In [9]:
def print_info(df):
    print ("INFO:")
    print(datasets[df].info(verbose=True, null_counts=True))
    print()
    print("DATA DESCRIPTION: ")
    print(datasets[df].describe())
In [10]:
for file_name in datasets.keys(): 
    print(f"File: {file_name}".upper())
    print("--------------------------")
    print_info(file_name)
    print()
    print("*************************************************************************************")
    print()
FILE: CREDIT_CARD_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   SK_ID_PREV                  3840312 non-null  int64  
 1   SK_ID_CURR                  3840312 non-null  int64  
 2   MONTHS_BALANCE              3840312 non-null  int64  
 3   AMT_BALANCE                 3840312 non-null  float64
 4   AMT_CREDIT_LIMIT_ACTUAL     3840312 non-null  int64  
 5   AMT_DRAWINGS_ATM_CURRENT    3090496 non-null  float64
 6   AMT_DRAWINGS_CURRENT        3840312 non-null  float64
 7   AMT_DRAWINGS_OTHER_CURRENT  3090496 non-null  float64
 8   AMT_DRAWINGS_POS_CURRENT    3090496 non-null  float64
 9   AMT_INST_MIN_REGULARITY     3535076 non-null  float64
 10  AMT_PAYMENT_CURRENT         3072324 non-null  float64
 11  AMT_PAYMENT_TOTAL_CURRENT   3840312 non-null  float64
 12  AMT_RECEIVABLE_PRINCIPAL    3840312 non-null  float64
 13  AMT_RECIVABLE               3840312 non-null  float64
 14  AMT_TOTAL_RECEIVABLE        3840312 non-null  float64
 15  CNT_DRAWINGS_ATM_CURRENT    3090496 non-null  float64
 16  CNT_DRAWINGS_CURRENT        3840312 non-null  int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  3090496 non-null  float64
 18  CNT_DRAWINGS_POS_CURRENT    3090496 non-null  float64
 19  CNT_INSTALMENT_MATURE_CUM   3535076 non-null  float64
 20  NAME_CONTRACT_STATUS        3840312 non-null  object 
 21  SK_DPD                      3840312 non-null  int64  
 22  SK_DPD_DEF                  3840312 non-null  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None

DATA DESCRIPTION: 
         SK_ID_PREV    SK_ID_CURR  MONTHS_BALANCE   AMT_BALANCE  \
count  3.840312e+06  3.840312e+06    3.840312e+06  3.840312e+06   
mean   1.904504e+06  2.783242e+05   -3.452192e+01  5.830016e+04   
std    5.364695e+05  1.027045e+05    2.666775e+01  1.063070e+05   
min    1.000018e+06  1.000060e+05   -9.600000e+01 -4.202502e+05   
25%    1.434385e+06  1.895170e+05   -5.500000e+01  0.000000e+00   
50%    1.897122e+06  2.783960e+05   -2.800000e+01  0.000000e+00   
75%    2.369328e+06  3.675800e+05   -1.100000e+01  8.904669e+04   
max    2.843496e+06  4.562500e+05   -1.000000e+00  1.505902e+06   

       AMT_CREDIT_LIMIT_ACTUAL  AMT_DRAWINGS_ATM_CURRENT  \
count             3.840312e+06              3.090496e+06   
mean              1.538080e+05              5.961325e+03   
std               1.651457e+05              2.822569e+04   
min               0.000000e+00             -6.827310e+03   
25%               4.500000e+04              0.000000e+00   
50%               1.125000e+05              0.000000e+00   
75%               1.800000e+05              0.000000e+00   
max               1.350000e+06              2.115000e+06   

       AMT_DRAWINGS_CURRENT  AMT_DRAWINGS_OTHER_CURRENT  \
count          3.840312e+06                3.090496e+06   
mean           7.433388e+03                2.881696e+02   
std            3.384608e+04                8.201989e+03   
min           -6.211620e+03                0.000000e+00   
25%            0.000000e+00                0.000000e+00   
50%            0.000000e+00                0.000000e+00   
75%            0.000000e+00                0.000000e+00   
max            2.287098e+06                1.529847e+06   

       AMT_DRAWINGS_POS_CURRENT  AMT_INST_MIN_REGULARITY  ...  \
count              3.090496e+06             3.535076e+06  ...   
mean               2.968805e+03             3.540204e+03  ...   
std                2.079689e+04             5.600154e+03  ...   
min                0.000000e+00             0.000000e+00  ...   
25%                0.000000e+00             0.000000e+00  ...   
50%                0.000000e+00             0.000000e+00  ...   
75%                0.000000e+00             6.633911e+03  ...   
max                2.239274e+06             2.028820e+05  ...   

       AMT_RECEIVABLE_PRINCIPAL  AMT_RECIVABLE  AMT_TOTAL_RECEIVABLE  \
count              3.840312e+06   3.840312e+06          3.840312e+06   
mean               5.596588e+04   5.808881e+04          5.809829e+04   
std                1.025336e+05   1.059654e+05          1.059718e+05   
min               -4.233058e+05  -4.202502e+05         -4.202502e+05   
25%                0.000000e+00   0.000000e+00          0.000000e+00   
50%                0.000000e+00   0.000000e+00          0.000000e+00   
75%                8.535924e+04   8.889949e+04          8.891451e+04   
max                1.472317e+06   1.493338e+06          1.493338e+06   

       CNT_DRAWINGS_ATM_CURRENT  CNT_DRAWINGS_CURRENT  \
count              3.090496e+06          3.840312e+06   
mean               3.094490e-01          7.031439e-01   
std                1.100401e+00          3.190347e+00   
min                0.000000e+00          0.000000e+00   
25%                0.000000e+00          0.000000e+00   
50%                0.000000e+00          0.000000e+00   
75%                0.000000e+00          0.000000e+00   
max                5.100000e+01          1.650000e+02   

       CNT_DRAWINGS_OTHER_CURRENT  CNT_DRAWINGS_POS_CURRENT  \
count                3.090496e+06              3.090496e+06   
mean                 4.812496e-03              5.594791e-01   
std                  8.263861e-02              3.240649e+00   
min                  0.000000e+00              0.000000e+00   
25%                  0.000000e+00              0.000000e+00   
50%                  0.000000e+00              0.000000e+00   
75%                  0.000000e+00              0.000000e+00   
max                  1.200000e+01              1.650000e+02   

       CNT_INSTALMENT_MATURE_CUM        SK_DPD    SK_DPD_DEF  
count               3.535076e+06  3.840312e+06  3.840312e+06  
mean                2.082508e+01  9.283667e+00  3.316220e-01  
std                 2.005149e+01  9.751570e+01  2.147923e+01  
min                 0.000000e+00  0.000000e+00  0.000000e+00  
25%                 4.000000e+00  0.000000e+00  0.000000e+00  
50%                 1.500000e+01  0.000000e+00  0.000000e+00  
75%                 3.200000e+01  0.000000e+00  0.000000e+00  
max                 1.200000e+02  3.260000e+03  3.260000e+03  

[8 rows x 22 columns]

*************************************************************************************

FILE: INSTALLMENTS_PAYMENTS
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Non-Null Count     Dtype  
---  ------                  --------------     -----  
 0   SK_ID_PREV              13605401 non-null  int64  
 1   SK_ID_CURR              13605401 non-null  int64  
 2   NUM_INSTALMENT_VERSION  13605401 non-null  float64
 3   NUM_INSTALMENT_NUMBER   13605401 non-null  int64  
 4   DAYS_INSTALMENT         13605401 non-null  float64
 5   DAYS_ENTRY_PAYMENT      13602496 non-null  float64
 6   AMT_INSTALMENT          13605401 non-null  float64
 7   AMT_PAYMENT             13602496 non-null  float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None

DATA DESCRIPTION: 
         SK_ID_PREV    SK_ID_CURR  NUM_INSTALMENT_VERSION  \
count  1.360540e+07  1.360540e+07            1.360540e+07   
mean   1.903365e+06  2.784449e+05            8.566373e-01   
std    5.362029e+05  1.027183e+05            1.035216e+00   
min    1.000001e+06  1.000010e+05            0.000000e+00   
25%    1.434191e+06  1.896390e+05            0.000000e+00   
50%    1.896520e+06  2.786850e+05            1.000000e+00   
75%    2.369094e+06  3.675300e+05            1.000000e+00   
max    2.843499e+06  4.562550e+05            1.780000e+02   

       NUM_INSTALMENT_NUMBER  DAYS_INSTALMENT  DAYS_ENTRY_PAYMENT  \
count           1.360540e+07     1.360540e+07        1.360250e+07   
mean            1.887090e+01    -1.042270e+03       -1.051114e+03   
std             2.666407e+01     8.009463e+02        8.005859e+02   
min             1.000000e+00    -2.922000e+03       -4.921000e+03   
25%             4.000000e+00    -1.654000e+03       -1.662000e+03   
50%             8.000000e+00    -8.180000e+02       -8.270000e+02   
75%             1.900000e+01    -3.610000e+02       -3.700000e+02   
max             2.770000e+02    -1.000000e+00       -1.000000e+00   

       AMT_INSTALMENT   AMT_PAYMENT  
count    1.360540e+07  1.360250e+07  
mean     1.705091e+04  1.723822e+04  
std      5.057025e+04  5.473578e+04  
min      0.000000e+00  0.000000e+00  
25%      4.226085e+03  3.398265e+03  
50%      8.884080e+03  8.125515e+03  
75%      1.671021e+04  1.610842e+04  
max      3.771488e+06  3.771488e+06  

*************************************************************************************

FILE: BUREAU_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Non-Null Count     Dtype 
---  ------          --------------     ----- 
 0   SK_ID_BUREAU    27299925 non-null  int64 
 1   MONTHS_BALANCE  27299925 non-null  int64 
 2   STATUS          27299925 non-null  object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None

DATA DESCRIPTION: 
       SK_ID_BUREAU  MONTHS_BALANCE
count  2.729992e+07    2.729992e+07
mean   6.036297e+06   -3.074169e+01
std    4.923489e+05    2.386451e+01
min    5.001709e+06   -9.600000e+01
25%    5.730933e+06   -4.600000e+01
50%    6.070821e+06   -2.500000e+01
75%    6.431951e+06   -1.100000e+01
max    6.842888e+06    0.000000e+00

*************************************************************************************

FILE: APPLICATION_TRAIN
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #    Column                        Non-Null Count   Dtype  
---   ------                        --------------   -----  
 0    SK_ID_CURR                    307511 non-null  int64  
 1    TARGET                        307511 non-null  int64  
 2    NAME_CONTRACT_TYPE            307511 non-null  object 
 3    CODE_GENDER                   307511 non-null  object 
 4    FLAG_OWN_CAR                  307511 non-null  object 
 5    FLAG_OWN_REALTY               307511 non-null  object 
 6    CNT_CHILDREN                  307511 non-null  int64  
 7    AMT_INCOME_TOTAL              307511 non-null  float64
 8    AMT_CREDIT                    307511 non-null  float64
 9    AMT_ANNUITY                   307499 non-null  float64
 10   AMT_GOODS_PRICE               307233 non-null  float64
 11   NAME_TYPE_SUITE               306219 non-null  object 
 12   NAME_INCOME_TYPE              307511 non-null  object 
 13   NAME_EDUCATION_TYPE           307511 non-null  object 
 14   NAME_FAMILY_STATUS            307511 non-null  object 
 15   NAME_HOUSING_TYPE             307511 non-null  object 
 16   REGION_POPULATION_RELATIVE    307511 non-null  float64
 17   DAYS_BIRTH                    307511 non-null  int64  
 18   DAYS_EMPLOYED                 307511 non-null  int64  
 19   DAYS_REGISTRATION             307511 non-null  float64
 20   DAYS_ID_PUBLISH               307511 non-null  int64  
 21   OWN_CAR_AGE                   104582 non-null  float64
 22   FLAG_MOBIL                    307511 non-null  int64  
 23   FLAG_EMP_PHONE                307511 non-null  int64  
 24   FLAG_WORK_PHONE               307511 non-null  int64  
 25   FLAG_CONT_MOBILE              307511 non-null  int64  
 26   FLAG_PHONE                    307511 non-null  int64  
 27   FLAG_EMAIL                    307511 non-null  int64  
 28   OCCUPATION_TYPE               211120 non-null  object 
 29   CNT_FAM_MEMBERS               307509 non-null  float64
 30   REGION_RATING_CLIENT          307511 non-null  int64  
 31   REGION_RATING_CLIENT_W_CITY   307511 non-null  int64  
 32   WEEKDAY_APPR_PROCESS_START    307511 non-null  object 
 33   HOUR_APPR_PROCESS_START       307511 non-null  int64  
 34   REG_REGION_NOT_LIVE_REGION    307511 non-null  int64  
 35   REG_REGION_NOT_WORK_REGION    307511 non-null  int64  
 36   LIVE_REGION_NOT_WORK_REGION   307511 non-null  int64  
 37   REG_CITY_NOT_LIVE_CITY        307511 non-null  int64  
 38   REG_CITY_NOT_WORK_CITY        307511 non-null  int64  
 39   LIVE_CITY_NOT_WORK_CITY       307511 non-null  int64  
 40   ORGANIZATION_TYPE             307511 non-null  object 
 41   EXT_SOURCE_1                  134133 non-null  float64
 42   EXT_SOURCE_2                  306851 non-null  float64
 43   EXT_SOURCE_3                  246546 non-null  float64
 44   APARTMENTS_AVG                151450 non-null  float64
 45   BASEMENTAREA_AVG              127568 non-null  float64
 46   YEARS_BEGINEXPLUATATION_AVG   157504 non-null  float64
 47   YEARS_BUILD_AVG               103023 non-null  float64
 48   COMMONAREA_AVG                92646 non-null   float64
 49   ELEVATORS_AVG                 143620 non-null  float64
 50   ENTRANCES_AVG                 152683 non-null  float64
 51   FLOORSMAX_AVG                 154491 non-null  float64
 52   FLOORSMIN_AVG                 98869 non-null   float64
 53   LANDAREA_AVG                  124921 non-null  float64
 54   LIVINGAPARTMENTS_AVG          97312 non-null   float64
 55   LIVINGAREA_AVG                153161 non-null  float64
 56   NONLIVINGAPARTMENTS_AVG       93997 non-null   float64
 57   NONLIVINGAREA_AVG             137829 non-null  float64
 58   APARTMENTS_MODE               151450 non-null  float64
 59   BASEMENTAREA_MODE             127568 non-null  float64
 60   YEARS_BEGINEXPLUATATION_MODE  157504 non-null  float64
 61   YEARS_BUILD_MODE              103023 non-null  float64
 62   COMMONAREA_MODE               92646 non-null   float64
 63   ELEVATORS_MODE                143620 non-null  float64
 64   ENTRANCES_MODE                152683 non-null  float64
 65   FLOORSMAX_MODE                154491 non-null  float64
 66   FLOORSMIN_MODE                98869 non-null   float64
 67   LANDAREA_MODE                 124921 non-null  float64
 68   LIVINGAPARTMENTS_MODE         97312 non-null   float64
 69   LIVINGAREA_MODE               153161 non-null  float64
 70   NONLIVINGAPARTMENTS_MODE      93997 non-null   float64
 71   NONLIVINGAREA_MODE            137829 non-null  float64
 72   APARTMENTS_MEDI               151450 non-null  float64
 73   BASEMENTAREA_MEDI             127568 non-null  float64
 74   YEARS_BEGINEXPLUATATION_MEDI  157504 non-null  float64
 75   YEARS_BUILD_MEDI              103023 non-null  float64
 76   COMMONAREA_MEDI               92646 non-null   float64
 77   ELEVATORS_MEDI                143620 non-null  float64
 78   ENTRANCES_MEDI                152683 non-null  float64
 79   FLOORSMAX_MEDI                154491 non-null  float64
 80   FLOORSMIN_MEDI                98869 non-null   float64
 81   LANDAREA_MEDI                 124921 non-null  float64
 82   LIVINGAPARTMENTS_MEDI         97312 non-null   float64
 83   LIVINGAREA_MEDI               153161 non-null  float64
 84   NONLIVINGAPARTMENTS_MEDI      93997 non-null   float64
 85   NONLIVINGAREA_MEDI            137829 non-null  float64
 86   FONDKAPREMONT_MODE            97216 non-null   object 
 87   HOUSETYPE_MODE                153214 non-null  object 
 88   TOTALAREA_MODE                159080 non-null  float64
 89   WALLSMATERIAL_MODE            151170 non-null  object 
 90   EMERGENCYSTATE_MODE           161756 non-null  object 
 91   OBS_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 92   DEF_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 93   OBS_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 94   DEF_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 95   DAYS_LAST_PHONE_CHANGE        307510 non-null  float64
 96   FLAG_DOCUMENT_2               307511 non-null  int64  
 97   FLAG_DOCUMENT_3               307511 non-null  int64  
 98   FLAG_DOCUMENT_4               307511 non-null  int64  
 99   FLAG_DOCUMENT_5               307511 non-null  int64  
 100  FLAG_DOCUMENT_6               307511 non-null  int64  
 101  FLAG_DOCUMENT_7               307511 non-null  int64  
 102  FLAG_DOCUMENT_8               307511 non-null  int64  
 103  FLAG_DOCUMENT_9               307511 non-null  int64  
 104  FLAG_DOCUMENT_10              307511 non-null  int64  
 105  FLAG_DOCUMENT_11              307511 non-null  int64  
 106  FLAG_DOCUMENT_12              307511 non-null  int64  
 107  FLAG_DOCUMENT_13              307511 non-null  int64  
 108  FLAG_DOCUMENT_14              307511 non-null  int64  
 109  FLAG_DOCUMENT_15              307511 non-null  int64  
 110  FLAG_DOCUMENT_16              307511 non-null  int64  
 111  FLAG_DOCUMENT_17              307511 non-null  int64  
 112  FLAG_DOCUMENT_18              307511 non-null  int64  
 113  FLAG_DOCUMENT_19              307511 non-null  int64  
 114  FLAG_DOCUMENT_20              307511 non-null  int64  
 115  FLAG_DOCUMENT_21              307511 non-null  int64  
 116  AMT_REQ_CREDIT_BUREAU_HOUR    265992 non-null  float64
 117  AMT_REQ_CREDIT_BUREAU_DAY     265992 non-null  float64
 118  AMT_REQ_CREDIT_BUREAU_WEEK    265992 non-null  float64
 119  AMT_REQ_CREDIT_BUREAU_MON     265992 non-null  float64
 120  AMT_REQ_CREDIT_BUREAU_QRT     265992 non-null  float64
 121  AMT_REQ_CREDIT_BUREAU_YEAR    265992 non-null  float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None

DATA DESCRIPTION: 
          SK_ID_CURR         TARGET   CNT_CHILDREN  AMT_INCOME_TOTAL  \
count  307511.000000  307511.000000  307511.000000      3.075110e+05   
mean   278180.518577       0.080729       0.417052      1.687979e+05   
std    102790.175348       0.272419       0.722121      2.371231e+05   
min    100002.000000       0.000000       0.000000      2.565000e+04   
25%    189145.500000       0.000000       0.000000      1.125000e+05   
50%    278202.000000       0.000000       0.000000      1.471500e+05   
75%    367142.500000       0.000000       1.000000      2.025000e+05   
max    456255.000000       1.000000      19.000000      1.170000e+08   

         AMT_CREDIT    AMT_ANNUITY  AMT_GOODS_PRICE  \
count  3.075110e+05  307499.000000     3.072330e+05   
mean   5.990260e+05   27108.573909     5.383962e+05   
std    4.024908e+05   14493.737315     3.694465e+05   
min    4.500000e+04    1615.500000     4.050000e+04   
25%    2.700000e+05   16524.000000     2.385000e+05   
50%    5.135310e+05   24903.000000     4.500000e+05   
75%    8.086500e+05   34596.000000     6.795000e+05   
max    4.050000e+06  258025.500000     4.050000e+06   

       REGION_POPULATION_RELATIVE     DAYS_BIRTH  DAYS_EMPLOYED  ...  \
count               307511.000000  307511.000000  307511.000000  ...   
mean                     0.020868  -16036.995067   63815.045904  ...   
std                      0.013831    4363.988632  141275.766519  ...   
min                      0.000290  -25229.000000  -17912.000000  ...   
25%                      0.010006  -19682.000000   -2760.000000  ...   
50%                      0.018850  -15750.000000   -1213.000000  ...   
75%                      0.028663  -12413.000000    -289.000000  ...   
max                      0.072508   -7489.000000  365243.000000  ...   

       FLAG_DOCUMENT_18  FLAG_DOCUMENT_19  FLAG_DOCUMENT_20  FLAG_DOCUMENT_21  \
count     307511.000000     307511.000000     307511.000000     307511.000000   
mean           0.008130          0.000595          0.000507          0.000335   
std            0.089798          0.024387          0.022518          0.018299   
min            0.000000          0.000000          0.000000          0.000000   
25%            0.000000          0.000000          0.000000          0.000000   
50%            0.000000          0.000000          0.000000          0.000000   
75%            0.000000          0.000000          0.000000          0.000000   
max            1.000000          1.000000          1.000000          1.000000   

       AMT_REQ_CREDIT_BUREAU_HOUR  AMT_REQ_CREDIT_BUREAU_DAY  \
count               265992.000000              265992.000000   
mean                     0.006402                   0.007000   
std                      0.083849                   0.110757   
min                      0.000000                   0.000000   
25%                      0.000000                   0.000000   
50%                      0.000000                   0.000000   
75%                      0.000000                   0.000000   
max                      4.000000                   9.000000   

       AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON  \
count               265992.000000              265992.000000   
mean                     0.034362                   0.267395   
std                      0.204685                   0.916002   
min                      0.000000                   0.000000   
25%                      0.000000                   0.000000   
50%                      0.000000                   0.000000   
75%                      0.000000                   0.000000   
max                      8.000000                  27.000000   

       AMT_REQ_CREDIT_BUREAU_QRT  AMT_REQ_CREDIT_BUREAU_YEAR  
count              265992.000000               265992.000000  
mean                    0.265474                    1.899974  
std                     0.794056                    1.869295  
min                     0.000000                    0.000000  
25%                     0.000000                    0.000000  
50%                     0.000000                    1.000000  
75%                     0.000000                    3.000000  
max                   261.000000                   25.000000  

[8 rows x 106 columns]

*************************************************************************************

FILE: POS_CASH_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Non-Null Count     Dtype  
---  ------                 --------------     -----  
 0   SK_ID_PREV             10001358 non-null  int64  
 1   SK_ID_CURR             10001358 non-null  int64  
 2   MONTHS_BALANCE         10001358 non-null  int64  
 3   CNT_INSTALMENT         9975287 non-null   float64
 4   CNT_INSTALMENT_FUTURE  9975271 non-null   float64
 5   NAME_CONTRACT_STATUS   10001358 non-null  object 
 6   SK_DPD                 10001358 non-null  int64  
 7   SK_DPD_DEF             10001358 non-null  int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None

DATA DESCRIPTION: 
         SK_ID_PREV    SK_ID_CURR  MONTHS_BALANCE  CNT_INSTALMENT  \
count  1.000136e+07  1.000136e+07    1.000136e+07    9.975287e+06   
mean   1.903217e+06  2.784039e+05   -3.501259e+01    1.708965e+01   
std    5.358465e+05  1.027637e+05    2.606657e+01    1.199506e+01   
min    1.000001e+06  1.000010e+05   -9.600000e+01    1.000000e+00   
25%    1.434405e+06  1.895500e+05   -5.400000e+01    1.000000e+01   
50%    1.896565e+06  2.786540e+05   -2.800000e+01    1.200000e+01   
75%    2.368963e+06  3.674290e+05   -1.300000e+01    2.400000e+01   
max    2.843499e+06  4.562550e+05   -1.000000e+00    9.200000e+01   

       CNT_INSTALMENT_FUTURE        SK_DPD    SK_DPD_DEF  
count           9.975271e+06  1.000136e+07  1.000136e+07  
mean            1.048384e+01  1.160693e+01  6.544684e-01  
std             1.110906e+01  1.327140e+02  3.276249e+01  
min             0.000000e+00  0.000000e+00  0.000000e+00  
25%             3.000000e+00  0.000000e+00  0.000000e+00  
50%             7.000000e+00  0.000000e+00  0.000000e+00  
75%             1.400000e+01  0.000000e+00  0.000000e+00  
max             8.500000e+01  4.231000e+03  3.595000e+03  

*************************************************************************************

FILE: APPLICATION_TEST
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 121 columns):
 #    Column                        Non-Null Count  Dtype  
---   ------                        --------------  -----  
 0    SK_ID_CURR                    48744 non-null  int64  
 1    NAME_CONTRACT_TYPE            48744 non-null  object 
 2    CODE_GENDER                   48744 non-null  object 
 3    FLAG_OWN_CAR                  48744 non-null  object 
 4    FLAG_OWN_REALTY               48744 non-null  object 
 5    CNT_CHILDREN                  48744 non-null  int64  
 6    AMT_INCOME_TOTAL              48744 non-null  float64
 7    AMT_CREDIT                    48744 non-null  float64
 8    AMT_ANNUITY                   48720 non-null  float64
 9    AMT_GOODS_PRICE               48744 non-null  float64
 10   NAME_TYPE_SUITE               47833 non-null  object 
 11   NAME_INCOME_TYPE              48744 non-null  object 
 12   NAME_EDUCATION_TYPE           48744 non-null  object 
 13   NAME_FAMILY_STATUS            48744 non-null  object 
 14   NAME_HOUSING_TYPE             48744 non-null  object 
 15   REGION_POPULATION_RELATIVE    48744 non-null  float64
 16   DAYS_BIRTH                    48744 non-null  int64  
 17   DAYS_EMPLOYED                 48744 non-null  int64  
 18   DAYS_REGISTRATION             48744 non-null  float64
 19   DAYS_ID_PUBLISH               48744 non-null  int64  
 20   OWN_CAR_AGE                   16432 non-null  float64
 21   FLAG_MOBIL                    48744 non-null  int64  
 22   FLAG_EMP_PHONE                48744 non-null  int64  
 23   FLAG_WORK_PHONE               48744 non-null  int64  
 24   FLAG_CONT_MOBILE              48744 non-null  int64  
 25   FLAG_PHONE                    48744 non-null  int64  
 26   FLAG_EMAIL                    48744 non-null  int64  
 27   OCCUPATION_TYPE               33139 non-null  object 
 28   CNT_FAM_MEMBERS               48744 non-null  float64
 29   REGION_RATING_CLIENT          48744 non-null  int64  
 30   REGION_RATING_CLIENT_W_CITY   48744 non-null  int64  
 31   WEEKDAY_APPR_PROCESS_START    48744 non-null  object 
 32   HOUR_APPR_PROCESS_START       48744 non-null  int64  
 33   REG_REGION_NOT_LIVE_REGION    48744 non-null  int64  
 34   REG_REGION_NOT_WORK_REGION    48744 non-null  int64  
 35   LIVE_REGION_NOT_WORK_REGION   48744 non-null  int64  
 36   REG_CITY_NOT_LIVE_CITY        48744 non-null  int64  
 37   REG_CITY_NOT_WORK_CITY        48744 non-null  int64  
 38   LIVE_CITY_NOT_WORK_CITY       48744 non-null  int64  
 39   ORGANIZATION_TYPE             48744 non-null  object 
 40   EXT_SOURCE_1                  28212 non-null  float64
 41   EXT_SOURCE_2                  48736 non-null  float64
 42   EXT_SOURCE_3                  40076 non-null  float64
 43   APARTMENTS_AVG                24857 non-null  float64
 44   BASEMENTAREA_AVG              21103 non-null  float64
 45   YEARS_BEGINEXPLUATATION_AVG   25888 non-null  float64
 46   YEARS_BUILD_AVG               16926 non-null  float64
 47   COMMONAREA_AVG                15249 non-null  float64
 48   ELEVATORS_AVG                 23555 non-null  float64
 49   ENTRANCES_AVG                 25165 non-null  float64
 50   FLOORSMAX_AVG                 25423 non-null  float64
 51   FLOORSMIN_AVG                 16278 non-null  float64
 52   LANDAREA_AVG                  20490 non-null  float64
 53   LIVINGAPARTMENTS_AVG          15964 non-null  float64
 54   LIVINGAREA_AVG                25192 non-null  float64
 55   NONLIVINGAPARTMENTS_AVG       15397 non-null  float64
 56   NONLIVINGAREA_AVG             22660 non-null  float64
 57   APARTMENTS_MODE               24857 non-null  float64
 58   BASEMENTAREA_MODE             21103 non-null  float64
 59   YEARS_BEGINEXPLUATATION_MODE  25888 non-null  float64
 60   YEARS_BUILD_MODE              16926 non-null  float64
 61   COMMONAREA_MODE               15249 non-null  float64
 62   ELEVATORS_MODE                23555 non-null  float64
 63   ENTRANCES_MODE                25165 non-null  float64
 64   FLOORSMAX_MODE                25423 non-null  float64
 65   FLOORSMIN_MODE                16278 non-null  float64
 66   LANDAREA_MODE                 20490 non-null  float64
 67   LIVINGAPARTMENTS_MODE         15964 non-null  float64
 68   LIVINGAREA_MODE               25192 non-null  float64
 69   NONLIVINGAPARTMENTS_MODE      15397 non-null  float64
 70   NONLIVINGAREA_MODE            22660 non-null  float64
 71   APARTMENTS_MEDI               24857 non-null  float64
 72   BASEMENTAREA_MEDI             21103 non-null  float64
 73   YEARS_BEGINEXPLUATATION_MEDI  25888 non-null  float64
 74   YEARS_BUILD_MEDI              16926 non-null  float64
 75   COMMONAREA_MEDI               15249 non-null  float64
 76   ELEVATORS_MEDI                23555 non-null  float64
 77   ENTRANCES_MEDI                25165 non-null  float64
 78   FLOORSMAX_MEDI                25423 non-null  float64
 79   FLOORSMIN_MEDI                16278 non-null  float64
 80   LANDAREA_MEDI                 20490 non-null  float64
 81   LIVINGAPARTMENTS_MEDI         15964 non-null  float64
 82   LIVINGAREA_MEDI               25192 non-null  float64
 83   NONLIVINGAPARTMENTS_MEDI      15397 non-null  float64
 84   NONLIVINGAREA_MEDI            22660 non-null  float64
 85   FONDKAPREMONT_MODE            15947 non-null  object 
 86   HOUSETYPE_MODE                25125 non-null  object 
 87   TOTALAREA_MODE                26120 non-null  float64
 88   WALLSMATERIAL_MODE            24851 non-null  object 
 89   EMERGENCYSTATE_MODE           26535 non-null  object 
 90   OBS_30_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 91   DEF_30_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 92   OBS_60_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 93   DEF_60_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 94   DAYS_LAST_PHONE_CHANGE        48744 non-null  float64
 95   FLAG_DOCUMENT_2               48744 non-null  int64  
 96   FLAG_DOCUMENT_3               48744 non-null  int64  
 97   FLAG_DOCUMENT_4               48744 non-null  int64  
 98   FLAG_DOCUMENT_5               48744 non-null  int64  
 99   FLAG_DOCUMENT_6               48744 non-null  int64  
 100  FLAG_DOCUMENT_7               48744 non-null  int64  
 101  FLAG_DOCUMENT_8               48744 non-null  int64  
 102  FLAG_DOCUMENT_9               48744 non-null  int64  
 103  FLAG_DOCUMENT_10              48744 non-null  int64  
 104  FLAG_DOCUMENT_11              48744 non-null  int64  
 105  FLAG_DOCUMENT_12              48744 non-null  int64  
 106  FLAG_DOCUMENT_13              48744 non-null  int64  
 107  FLAG_DOCUMENT_14              48744 non-null  int64  
 108  FLAG_DOCUMENT_15              48744 non-null  int64  
 109  FLAG_DOCUMENT_16              48744 non-null  int64  
 110  FLAG_DOCUMENT_17              48744 non-null  int64  
 111  FLAG_DOCUMENT_18              48744 non-null  int64  
 112  FLAG_DOCUMENT_19              48744 non-null  int64  
 113  FLAG_DOCUMENT_20              48744 non-null  int64  
 114  FLAG_DOCUMENT_21              48744 non-null  int64  
 115  AMT_REQ_CREDIT_BUREAU_HOUR    42695 non-null  float64
 116  AMT_REQ_CREDIT_BUREAU_DAY     42695 non-null  float64
 117  AMT_REQ_CREDIT_BUREAU_WEEK    42695 non-null  float64
 118  AMT_REQ_CREDIT_BUREAU_MON     42695 non-null  float64
 119  AMT_REQ_CREDIT_BUREAU_QRT     42695 non-null  float64
 120  AMT_REQ_CREDIT_BUREAU_YEAR    42695 non-null  float64
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None

DATA DESCRIPTION: 
          SK_ID_CURR  CNT_CHILDREN  AMT_INCOME_TOTAL    AMT_CREDIT  \
count   48744.000000  48744.000000      4.874400e+04  4.874400e+04   
mean   277796.676350      0.397054      1.784318e+05  5.167404e+05   
std    103169.547296      0.709047      1.015226e+05  3.653970e+05   
min    100001.000000      0.000000      2.694150e+04  4.500000e+04   
25%    188557.750000      0.000000      1.125000e+05  2.606400e+05   
50%    277549.000000      0.000000      1.575000e+05  4.500000e+05   
75%    367555.500000      1.000000      2.250000e+05  6.750000e+05   
max    456250.000000     20.000000      4.410000e+06  2.245500e+06   

         AMT_ANNUITY  AMT_GOODS_PRICE  REGION_POPULATION_RELATIVE  \
count   48720.000000     4.874400e+04                48744.000000   
mean    29426.240209     4.626188e+05                    0.021226   
std     16016.368315     3.367102e+05                    0.014428   
min      2295.000000     4.500000e+04                    0.000253   
25%     17973.000000     2.250000e+05                    0.010006   
50%     26199.000000     3.960000e+05                    0.018850   
75%     37390.500000     6.300000e+05                    0.028663   
max    180576.000000     2.245500e+06                    0.072508   

         DAYS_BIRTH  DAYS_EMPLOYED  DAYS_REGISTRATION  ...  FLAG_DOCUMENT_18  \
count  48744.000000   48744.000000       48744.000000  ...      48744.000000   
mean  -16068.084605   67485.366322       -4967.652716  ...          0.001559   
std     4325.900393  144348.507136        3552.612035  ...          0.039456   
min   -25195.000000  -17463.000000      -23722.000000  ...          0.000000   
25%   -19637.000000   -2910.000000       -7459.250000  ...          0.000000   
50%   -15785.000000   -1293.000000       -4490.000000  ...          0.000000   
75%   -12496.000000    -296.000000       -1901.000000  ...          0.000000   
max    -7338.000000  365243.000000           0.000000  ...          1.000000   

       FLAG_DOCUMENT_19  FLAG_DOCUMENT_20  FLAG_DOCUMENT_21  \
count           48744.0           48744.0           48744.0   
mean                0.0               0.0               0.0   
std                 0.0               0.0               0.0   
min                 0.0               0.0               0.0   
25%                 0.0               0.0               0.0   
50%                 0.0               0.0               0.0   
75%                 0.0               0.0               0.0   
max                 0.0               0.0               0.0   

       AMT_REQ_CREDIT_BUREAU_HOUR  AMT_REQ_CREDIT_BUREAU_DAY  \
count                42695.000000               42695.000000   
mean                     0.002108                   0.001803   
std                      0.046373                   0.046132   
min                      0.000000                   0.000000   
25%                      0.000000                   0.000000   
50%                      0.000000                   0.000000   
75%                      0.000000                   0.000000   
max                      2.000000                   2.000000   

       AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON  \
count                42695.000000               42695.000000   
mean                     0.002787                   0.009299   
std                      0.054037                   0.110924   
min                      0.000000                   0.000000   
25%                      0.000000                   0.000000   
50%                      0.000000                   0.000000   
75%                      0.000000                   0.000000   
max                      2.000000                   6.000000   

       AMT_REQ_CREDIT_BUREAU_QRT  AMT_REQ_CREDIT_BUREAU_YEAR  
count               42695.000000                42695.000000  
mean                    0.546902                    1.983769  
std                     0.693305                    1.838873  
min                     0.000000                    0.000000  
25%                     0.000000                    0.000000  
50%                     0.000000                    2.000000  
75%                     1.000000                    3.000000  
max                     7.000000                   17.000000  

[8 rows x 105 columns]

*************************************************************************************

FILE: BUREAU
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   SK_ID_CURR              1716428 non-null  int64  
 1   SK_ID_BUREAU            1716428 non-null  int64  
 2   CREDIT_ACTIVE           1716428 non-null  object 
 3   CREDIT_CURRENCY         1716428 non-null  object 
 4   DAYS_CREDIT             1716428 non-null  int64  
 5   CREDIT_DAY_OVERDUE      1716428 non-null  int64  
 6   DAYS_CREDIT_ENDDATE     1610875 non-null  float64
 7   DAYS_ENDDATE_FACT       1082775 non-null  float64
 8   AMT_CREDIT_MAX_OVERDUE  591940 non-null   float64
 9   CNT_CREDIT_PROLONG      1716428 non-null  int64  
 10  AMT_CREDIT_SUM          1716415 non-null  float64
 11  AMT_CREDIT_SUM_DEBT     1458759 non-null  float64
 12  AMT_CREDIT_SUM_LIMIT    1124648 non-null  float64
 13  AMT_CREDIT_SUM_OVERDUE  1716428 non-null  float64
 14  CREDIT_TYPE             1716428 non-null  object 
 15  DAYS_CREDIT_UPDATE      1716428 non-null  int64  
 16  AMT_ANNUITY             489637 non-null   float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None

DATA DESCRIPTION: 
         SK_ID_CURR  SK_ID_BUREAU   DAYS_CREDIT  CREDIT_DAY_OVERDUE  \
count  1.716428e+06  1.716428e+06  1.716428e+06        1.716428e+06   
mean   2.782149e+05  5.924434e+06 -1.142108e+03        8.181666e-01   
std    1.029386e+05  5.322657e+05  7.951649e+02        3.654443e+01   
min    1.000010e+05  5.000000e+06 -2.922000e+03        0.000000e+00   
25%    1.888668e+05  5.463954e+06 -1.666000e+03        0.000000e+00   
50%    2.780550e+05  5.926304e+06 -9.870000e+02        0.000000e+00   
75%    3.674260e+05  6.385681e+06 -4.740000e+02        0.000000e+00   
max    4.562550e+05  6.843457e+06  0.000000e+00        2.792000e+03   

       DAYS_CREDIT_ENDDATE  DAYS_ENDDATE_FACT  AMT_CREDIT_MAX_OVERDUE  \
count         1.610875e+06       1.082775e+06            5.919400e+05   
mean          5.105174e+02      -1.017437e+03            3.825418e+03   
std           4.994220e+03       7.140106e+02            2.060316e+05   
min          -4.206000e+04      -4.202300e+04            0.000000e+00   
25%          -1.138000e+03      -1.489000e+03            0.000000e+00   
50%          -3.300000e+02      -8.970000e+02            0.000000e+00   
75%           4.740000e+02      -4.250000e+02            0.000000e+00   
max           3.119900e+04       0.000000e+00            1.159872e+08   

       CNT_CREDIT_PROLONG  AMT_CREDIT_SUM  AMT_CREDIT_SUM_DEBT  \
count        1.716428e+06    1.716415e+06         1.458759e+06   
mean         6.410406e-03    3.549946e+05         1.370851e+05   
std          9.622391e-02    1.149811e+06         6.774011e+05   
min          0.000000e+00    0.000000e+00        -4.705600e+06   
25%          0.000000e+00    5.130000e+04         0.000000e+00   
50%          0.000000e+00    1.255185e+05         0.000000e+00   
75%          0.000000e+00    3.150000e+05         4.015350e+04   
max          9.000000e+00    5.850000e+08         1.701000e+08   

       AMT_CREDIT_SUM_LIMIT  AMT_CREDIT_SUM_OVERDUE  DAYS_CREDIT_UPDATE  \
count          1.124648e+06            1.716428e+06        1.716428e+06   
mean           6.229515e+03            3.791276e+01       -5.937483e+02   
std            4.503203e+04            5.937650e+03        7.207473e+02   
min           -5.864061e+05            0.000000e+00       -4.194700e+04   
25%            0.000000e+00            0.000000e+00       -9.080000e+02   
50%            0.000000e+00            0.000000e+00       -3.950000e+02   
75%            0.000000e+00            0.000000e+00       -3.300000e+01   
max            4.705600e+06            3.756681e+06        3.720000e+02   

        AMT_ANNUITY  
count  4.896370e+05  
mean   1.571276e+04  
std    3.258269e+05  
min    0.000000e+00  
25%    0.000000e+00  
50%    0.000000e+00  
75%    1.350000e+04  
max    1.184534e+08  

*************************************************************************************

FILE: PREVIOUS_APPLICATION
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None

DATA DESCRIPTION: 
         SK_ID_PREV    SK_ID_CURR   AMT_ANNUITY  AMT_APPLICATION  \
count  1.670214e+06  1.670214e+06  1.297979e+06     1.670214e+06   
mean   1.923089e+06  2.783572e+05  1.595512e+04     1.752339e+05   
std    5.325980e+05  1.028148e+05  1.478214e+04     2.927798e+05   
min    1.000001e+06  1.000010e+05  0.000000e+00     0.000000e+00   
25%    1.461857e+06  1.893290e+05  6.321780e+03     1.872000e+04   
50%    1.923110e+06  2.787145e+05  1.125000e+04     7.104600e+04   
75%    2.384280e+06  3.675140e+05  2.065842e+04     1.803600e+05   
max    2.845382e+06  4.562550e+05  4.180581e+05     6.905160e+06   

         AMT_CREDIT  AMT_DOWN_PAYMENT  AMT_GOODS_PRICE  \
count  1.670213e+06      7.743700e+05     1.284699e+06   
mean   1.961140e+05      6.697402e+03     2.278473e+05   
std    3.185746e+05      2.092150e+04     3.153966e+05   
min    0.000000e+00     -9.000000e-01     0.000000e+00   
25%    2.416050e+04      0.000000e+00     5.084100e+04   
50%    8.054100e+04      1.638000e+03     1.123200e+05   
75%    2.164185e+05      7.740000e+03     2.340000e+05   
max    6.905160e+06      3.060045e+06     6.905160e+06   

       HOUR_APPR_PROCESS_START  NFLAG_LAST_APPL_IN_DAY  RATE_DOWN_PAYMENT  \
count             1.670214e+06            1.670214e+06      774370.000000   
mean              1.248418e+01            9.964675e-01           0.079637   
std               3.334028e+00            5.932963e-02           0.107823   
min               0.000000e+00            0.000000e+00          -0.000015   
25%               1.000000e+01            1.000000e+00           0.000000   
50%               1.200000e+01            1.000000e+00           0.051605   
75%               1.500000e+01            1.000000e+00           0.108909   
max               2.300000e+01            1.000000e+00           1.000000   

       ...  RATE_INTEREST_PRIVILEGED  DAYS_DECISION  SELLERPLACE_AREA  \
count  ...               5951.000000   1.670214e+06      1.670214e+06   
mean   ...                  0.773503  -8.806797e+02      3.139511e+02   
std    ...                  0.100879   7.790997e+02      7.127443e+03   
min    ...                  0.373150  -2.922000e+03     -1.000000e+00   
25%    ...                  0.715645  -1.300000e+03     -1.000000e+00   
50%    ...                  0.835095  -5.810000e+02      3.000000e+00   
75%    ...                  0.852537  -2.800000e+02      8.200000e+01   
max    ...                  1.000000  -1.000000e+00      4.000000e+06   

        CNT_PAYMENT  DAYS_FIRST_DRAWING  DAYS_FIRST_DUE  \
count  1.297984e+06       997149.000000   997149.000000   
mean   1.605408e+01       342209.855039    13826.269337   
std    1.456729e+01        88916.115833    72444.869708   
min    0.000000e+00        -2922.000000    -2892.000000   
25%    6.000000e+00       365243.000000    -1628.000000   
50%    1.200000e+01       365243.000000     -831.000000   
75%    2.400000e+01       365243.000000     -411.000000   
max    8.400000e+01       365243.000000   365243.000000   

       DAYS_LAST_DUE_1ST_VERSION  DAYS_LAST_DUE  DAYS_TERMINATION  \
count              997149.000000  997149.000000     997149.000000   
mean                33767.774054   76582.403064      81992.343838   
std                106857.034789  149647.415123     153303.516729   
min                 -2801.000000   -2889.000000      -2874.000000   
25%                 -1242.000000   -1314.000000      -1270.000000   
50%                  -361.000000    -537.000000       -499.000000   
75%                   129.000000     -74.000000        -44.000000   
max                365243.000000  365243.000000     365243.000000   

       NFLAG_INSURED_ON_APPROVAL  
count              997149.000000  
mean                    0.332570  
std                     0.471134  
min                     0.000000  
25%                     0.000000  
50%                     0.000000  
75%                     1.000000  
max                     1.000000  

[8 rows x 21 columns]

*************************************************************************************

Feature Selection for application_train:¶

Correlations for Numerical Data¶

In [11]:
#application_train
train = datasets['application_train']
corrs = pd.DataFrame(train.corr()['TARGET']).rename(columns={"TARGET":"cor"})
corrs["abs_corr"] = corrs.abs()
corrs = corrs.sort_values("cor")
print(corrs)
                                  cor  abs_corr
EXT_SOURCE_3                -0.178919  0.178919
EXT_SOURCE_2                -0.160472  0.160472
EXT_SOURCE_1                -0.155317  0.155317
DAYS_EMPLOYED               -0.044932  0.044932
FLOORSMAX_AVG               -0.044003  0.044003
...                               ...       ...
DAYS_LAST_PHONE_CHANGE       0.055218  0.055218
REGION_RATING_CLIENT         0.058899  0.058899
REGION_RATING_CLIENT_W_CITY  0.060893  0.060893
DAYS_BIRTH                   0.078239  0.078239
TARGET                       1.000000  1.000000

[106 rows x 2 columns]
In [12]:
# Top Correlated Features
print("10 Most positive correlations to Target:")
print("-------------------------------------------------------")
print(corrs["cor"].tail(10))
print()

print("\n10 Most negative correlations to Target:")
print("-------------------------------------------------------")
print(corrs["cor"].head(10))
print()

print("\n10 Most correlated to Target (by absolute value):")
print("-------------------------------------------------------")
top_10_corrs = corrs.sort_values("abs_corr", ascending=False).head(11)
print(top_10_corrs)
10 Most positive correlations to Target:
-------------------------------------------------------
FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: cor, dtype: float64


10 Most negative correlations to Target:
-------------------------------------------------------
EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
DAYS_EMPLOYED                -0.044932
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
Name: cor, dtype: float64


10 Most correlated to Target (by absolute value):
-------------------------------------------------------
                                  cor  abs_corr
TARGET                       1.000000  1.000000
EXT_SOURCE_3                -0.178919  0.178919
EXT_SOURCE_2                -0.160472  0.160472
EXT_SOURCE_1                -0.155317  0.155317
DAYS_BIRTH                   0.078239  0.078239
REGION_RATING_CLIENT_W_CITY  0.060893  0.060893
REGION_RATING_CLIENT         0.058899  0.058899
DAYS_LAST_PHONE_CHANGE       0.055218  0.055218
DAYS_ID_PUBLISH              0.051457  0.051457
REG_CITY_NOT_WORK_CITY       0.050994  0.050994
FLAG_EMP_PHONE               0.045982  0.045982

Other logical numerical variables to consider:

  • 'AMT_INCOME_TOTAL'
  • 'AMT_CREDIT'

Total income, Amount of credit are all impactful factors on a person's likelihood of paying back a loan.

In [13]:
#Update Numerical Features List to account for Correlation and Logic
selected_num_features = list(top_10_corrs.index)
other_num_features = ['AMT_INCOME_TOTAL','AMT_CREDIT']

for feature in other_num_features: 
    selected_num_features.append(feature)
In [14]:
print("Updated Numerical Features: \n")
for col in selected_num_features:
    print(col)
    
print(f"\n# of Updated Numerical Features Based on High Correlation: {len(selected_num_features)}")
Updated Numerical Features: 

TARGET
EXT_SOURCE_3
EXT_SOURCE_2
EXT_SOURCE_1
DAYS_BIRTH
REGION_RATING_CLIENT_W_CITY
REGION_RATING_CLIENT
DAYS_LAST_PHONE_CHANGE
DAYS_ID_PUBLISH
REG_CITY_NOT_WORK_CITY
FLAG_EMP_PHONE
AMT_INCOME_TOTAL
AMT_CREDIT

# of Updated Numerical Features Based on High Correlation: 13
In [15]:
#Distribution Plots of highest correlated input variables. 
# selected_num_features.remove('TARGET')

cnt_cols = len(selected_num_features)

plt.figure(figsize = (20,40))
for i, var in enumerate (selected_num_features):
    plt.subplot(cnt_cols,5, i+1)
    datasets["application_train"][var].hist()
    
    plt.title (var)
    plt.tight_layout()

plt.show()
In [16]:
#Correlation Heatmap of top most correlated variables with Target
# selected_num_features.insert(0, 'TARGET')
selected_num_features_df = train[selected_num_features]

#Correlation Matrix
selected_num_features_cm = selected_num_features_df.corr()

#Plot Correlation Matrix as a heatmap
mask = np.triu(selected_num_features_cm)

plt.figure(figsize=(20,20))
sns.heatmap(selected_num_features_cm, cmap=plt.cm.coolwarm, annot=True, mask=mask )
plt.title("Correlation Heatmap of Top Correlated Features to Target in application_train")
Out[16]:
Text(0.5, 1.0, 'Correlation Heatmap of Top Correlated Features to Target in application_train')
In [17]:
# Reference: "https://www.geeksforgeeks.org/sort-correlation-matrix-in-python/".

def get_top_abs_correlations(cm):
    # Retain upper triangular values of correlation matrix and 
    # make Lower triangular values Null
    upper_corr_mat = cm.where(np.triu(np.ones(cm.shape),k=1).astype(np.bool))
    
    # Convert to 1-D series and drop Null values 
    unique_corr_pairs = upper_corr_mat.unstack().dropna() 
    
    # Sort correlation pairs 
    sorted_mat = unique_corr_pairs.abs().sort_values() 
    
    return (sorted_mat[sorted_mat > 0.7])


top_abs_corrs = pd.DataFrame(get_top_abs_correlations(selected_num_features_cm))
print("Absolute Correlations > 0.7 Pearson Coefficent:")
top_abs_corrs.columns = ['Correlation Factor']
print(top_abs_corrs)
Absolute Correlations > 0.7 Pearson Coefficent:
                                                  Correlation Factor
REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY            0.950842
In [18]:
top_abs_corrs ['Feature 1 Correlation with Target']=0
top_abs_corrs ['Feature 2 Correlation with Target']=0
i=0

for feature in top_abs_corrs.index:
    top_abs_corrs ['Feature 1 Correlation with Target'].iloc[i] = selected_num_features_cm['TARGET'].loc[feature[0]]
    top_abs_corrs ['Feature 2 Correlation with Target'].iloc[i] = selected_num_features_cm['TARGET'].loc[feature[1]]
    i+=1
In [19]:
top_abs_corrs
Out[19]:
Correlation Factor Feature 1 Correlation with Target Feature 2 Correlation with Target
REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.950842 0.058899 0.060893

Correlation Observations:¶

In the updated selected numerical columns selected from the Application Training dataset, some of the input features were highly correlated with each other. We considered a 'high correlation' when the Pearson Correlation Factor was greater than 0.7, per industry standard.

Feature Selection For these highly correlated input features, we selected one input feature per pair based on the highest correlation factor with the target variable and best judgement if this factor is the same.

Input Features to drop:

  • 'TARGET' --> Output Feature
  • 'FLAG_EMP_PHONE' --> Flag is categorical, Keep Days_employed
  • 'FLOORSMAX_MEDI'
  • 'FLOORSMAX_MODE'
  • 'AMT_CREDIT'
  • 'REGION_RATING_CLIENT'
  • 'REGION_POPULATION_RELATIVE'
  • 'DAYS_EMPLOYED'
  • 'REG_CITY_NOT_LIVE_CITY'
  • 'FLAG_DOCUMENT_THREE' --> Flag is Categorical
In [20]:
inputs_to_drop = ['TARGET','REGION_RATING_CLIENT','FLAG_EMP_PHONE']

for input_var in inputs_to_drop: 
    selected_num_features.remove(input_var)
    
print("Updated Numerical Features based on Correlation Accounting for Multicollinearity:".upper())    
print("-------------------------------------------------------------------------------------")
for col in selected_num_features:
    print(col)
    
print(f"\n# of Variables Listed Above: {len(selected_num_features)}") 
UPDATED NUMERICAL FEATURES BASED ON CORRELATION ACCOUNTING FOR MULTICOLLINEARITY:
-------------------------------------------------------------------------------------
EXT_SOURCE_3
EXT_SOURCE_2
EXT_SOURCE_1
DAYS_BIRTH
REGION_RATING_CLIENT_W_CITY
DAYS_LAST_PHONE_CHANGE
DAYS_ID_PUBLISH
REG_CITY_NOT_WORK_CITY
AMT_INCOME_TOTAL
AMT_CREDIT

# of Variables Listed Above: 10

Categorical Features¶

In [21]:
selected_cat_features = []
for col in train: 
    if train[col].dtype == 'object':
        selected_cat_features.append(col)

#Print Categorical Features
print("Categorical Features:")
print("---------------------")
for col in selected_cat_features: 
    print(col)
    
selected_cat_features_len = len(selected_cat_features)
print(f"\n# of Categorical Features: {selected_cat_features_len}\n")
Categorical Features:
---------------------
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
NAME_TYPE_SUITE
NAME_INCOME_TYPE
NAME_EDUCATION_TYPE
NAME_FAMILY_STATUS
NAME_HOUSING_TYPE
OCCUPATION_TYPE
WEEKDAY_APPR_PROCESS_START
ORGANIZATION_TYPE
FONDKAPREMONT_MODE
HOUSETYPE_MODE
WALLSMATERIAL_MODE
EMERGENCYSTATE_MODE

# of Categorical Features: 16

In [22]:
import math
plt.figure(figsize=(10,12)),

fig_rows = math.ceil(selected_cat_features_len/2)
fig, ax = plt.subplots(fig_rows, 2, figsize = (20,50))
col_index = 0
    
for idx, cat in enumerate(selected_cat_features):
    plt.subplot(fig_rows, 2, idx+1)
    sns.countplot(train[cat], hue=train['TARGET'])
    plt.title(f"Distribution of Variable: {cat}")
    plt.xticks(rotation=90)
    plt.tight_layout()
<Figure size 720x864 with 0 Axes>

Based on the the above histograms, the following categorical variables will be dropped:

  • 'NAME_TYPE_SUITE' - Who accompanied client when applying for the previous application doesn't really affect ability to pay
  • 'NAME_HOUSING_TYPE' - Majority of people live in either house or apartment
  • 'WEEKDAY_APPR_PROCESS_START'- Which day of the week a client applied for the loan doesn't really affect ability to pay
  • 'FONDKAPREMONT' - Distribution appears similar across all segments
  • 'HOUSETYPE_MODE' - Almost all applications are from 'block of flats'
  • 'WALLSMATERIAL_MODE' - Almost all wall materials are stone,brick or Panel
  • 'EMERGENCYSTATE_MODE' - Almost all applications were not emergency state mode
  • 'ORGANIZATION_TYPE'- Appears most organization types have relatively balanced proportions of paid and not paid loans. Also, this can be assumed to have a high correlation with occupation type.
  • 'NAME_INCOME_TYPE' - Presumably has high correlation with 'OCCUPATION' and multiple empty categories
In [23]:
inputs_to_drop = ['NAME_TYPE_SUITE', 'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START', 
                  'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE','ORGANIZATION_TYPE','NAME_INCOME_TYPE']

for input_var in inputs_to_drop: 
    selected_cat_features.remove(input_var)
    
print("Updated Categorical Columns based on Histograms of Distributions".upper())    
print("-------------------------------------------------------------------------------------")
for col in selected_cat_features:
    print(col)
    
print(f"\n# of Variables Listed Above Minus: {len(selected_cat_features)}") 
UPDATED CATEGORICAL COLUMNS BASED ON HISTOGRAMS OF DISTRIBUTIONS
-------------------------------------------------------------------------------------
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
NAME_EDUCATION_TYPE
NAME_FAMILY_STATUS
OCCUPATION_TYPE

# of Variables Listed Above Minus: 7

Final Selected Numerical and Categorical Features in Application Training Set¶

In [24]:
print("Final Features Selected from Application Training Set: ")
print()
print('NUMERICAL FEATURES: ')
print('----------------------')
for col in selected_num_features: 
    print(col)
print(f"\n# of Variables Listed Above: {len(selected_num_features)}") 
print()
print()
print('CATEGORICAL FEATURES: ')
print('----------------------')
for col in selected_cat_features: 
    print(col)
print(f"\n# of Variables Listed Above: {len(selected_cat_features)}") 
Final Features Selected from Application Training Set: 

NUMERICAL FEATURES: 
----------------------
EXT_SOURCE_3
EXT_SOURCE_2
EXT_SOURCE_1
DAYS_BIRTH
REGION_RATING_CLIENT_W_CITY
DAYS_LAST_PHONE_CHANGE
DAYS_ID_PUBLISH
REG_CITY_NOT_WORK_CITY
AMT_INCOME_TOTAL
AMT_CREDIT

# of Variables Listed Above: 10


CATEGORICAL FEATURES: 
----------------------
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
NAME_EDUCATION_TYPE
NAME_FAMILY_STATUS
OCCUPATION_TYPE

# of Variables Listed Above: 7
In [ ]:
 

Baseline Experiments¶

In [25]:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from time import time
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
In [26]:
#Table to track experimental results
try:
    expLog
except NameError:

    expLog = pd.DataFrame(columns=["Experiment Number",
                                   "Model",
                                   "# Transformed Input Features",
                                   "# Original Numerical Features",
                                   "# Original Categorical Features",
                                   "Train Acc",
                                   "Valid Acc",
                                   "Test Acc",
                                   "Train F1",
                                   "Valid F1",    
                                   "Test F1",
                                   "Train AUROC",
                                   "Valid AUROC",                                   
                                   "Test AUROC",                                 
                                   "Training Time",
                                   "Training Prediction Time",
                                   "Validation Prediction Time",
                                   "Test Prediction Time",
                                   "Hyperparameters",
                                   "Best Parameter",
                                   "Best Hypertuning Score",
                                   "Description"])

display(expLog)
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description

0 rows × 22 columns

In [27]:
# Function to train models
def train_model(df, exp_name, num_features, cat_features, pipeline):
    
    features = num_features + cat_features

    # Split data into Train, Test, and Validation Sets
    y = train['TARGET']
    X = train[features]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

    print(f"X train           shape: {X_train.shape}")
    print(f"X validation      shape: {X_valid.shape}")
    print(f"X test            shape: {X_test.shape}")
    
    
    print("\nPERFORMING TRAINING: {exp_name}")
    print("\tPipeline:",[name for name, _ in pipeline.steps])
    print("\t# Total Features: ", len(features))
    
    print("\nNumerical Features:")
    print(num_features)
    print("\t# Numerical Features: ", len(num_features))

    print("\nCategorical Features:")
    print(cat_features)
    print("\t# Categorical Features: ", len(cat_features))

    print('\ntraining in progress...')

    #Fit the baseline pipeline to Training data
    start=time()
    model = pipeline.fit(X_train, y_train)
    train_time = np.round(time() - start, 4)
    print(f"\nBaseline Experiment with Original {len(features)} Input Variables - Training Time: %0.3fs" % (train_time))
    
    return features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time
In [28]:
from sklearn.metrics import confusion_matrix

#Function to predict and score trained models
def predict_and_score(X, y, model, model_ID):
    start = time()
    y_pred = model.predict(X)
    pred_time = time() - start
    
    print("\tPrediction Time: %0.3fs" % (pred_time))
    
    acc = accuracy_score(y, y_pred)
    print("\tAccuracy Score: ", acc)
    
    f1 = f1_score(y, y_pred)
    print("\tF1 Score: ", f1)
    
    auroc = roc_auc_score(y, model.predict_proba(X)[:, 1])
    print("\tAUROC Score: ", auroc)
    
    print("\tConfusion Matrix:")
    class_labels = ["0: Repaid","1: Not Repaid"]
    cm = confusion_matrix(y,y_pred).astype(np.float32)
    cm /= cm.sum(axis=1)[:, np.newaxis]
    cm_plot = sns.heatmap(cm, vmin=0, vmax=1, annot=True, cmap="Reds")
    plt.xlabel("Predicted", fontsize=13)
    plt.ylabel("True", fontsize=13)
    cm_plot.set(xticklabels=class_labels, yticklabels=class_labels)
    plt.title(model_ID, fontsize=13)
    plt.show()

    return (cm, y_pred, pred_time, acc, f1, auroc)

Data Preprocessing Pipelines¶

OHE when previously unseen unique values in the test/validation set¶

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

  • The OneHotEncoder is fitted to the training set, which means that for each unique value present in the training set, for each feature, a new column is created. Let's say we have 39 columns after the encoding up from 30 (before preprocessing).
  • The output is a numpy array (when the option sparse=False is used), which has the disadvantage of losing all the information about the original column names and values.
  • When we try to transform the test set, after having fitted the encoder to the training set, we obtain a ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Level 3 Pipelines¶

In [29]:
# Pipeline for the numeric features.
# Use StandardScaler() to standardize the data, Missing values will be imputed using the feature mean.
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler(with_mean=False))
])


# Pipeline for the categorical features.
# Entries with missing values or values that don't exist in the range defined above will be one hot encoded as zeroes.
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value= 'missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Level 2 Pipeline¶

In [30]:
#features_pipeline to combine Numerical and Categorical Pipelines
data_pipeline_17 = ColumnTransformer(
    transformers= [
        ('num', num_pipeline, selected_num_features), 
        ('cat', cat_pipeline, selected_cat_features)],
        remainder='drop',
        n_jobs=-1
    )

# Baseline Experiment
baseline_pipeline_17 = Pipeline([
        ("preparation", data_pipeline_17),
        ("logRegression", LogisticRegression())
    ])

#Name of Experiment
exp_name = "Baseline 1, LogReg with Original 17 Selected Features"

#Description of Experiments
description = 'Baseline 1 LogReg Model with Preselected Num and Cat Features.'

#Start Experiment count for the expLog
exp_count = 1

Baseline Experiment (Level 1 Pipeline) with 17 Selected Features in application_train¶

In [31]:
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, baseline_pipeline_17)
X train           shape: (209107, 17)
X validation      shape: (52277, 17)
X test            shape: (46127, 17)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'logRegression']
	# Total Features:  17

Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
	# Numerical Features:  10

Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
	# Categorical Features:  7

training in progress...

Baseline Experiment with Original 17 Input Variables - Training Time: 2.935s

Prediction and Scoring¶

In [32]:
X_train_transformed_17 = data_pipeline_17.fit_transform(X_train)
total_inputs_17 = X_train_transformed_17.shape[1]

# Training Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Training Set:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc= predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Validation Set:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Test Set:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc= predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Baseline Experiment with 49 Variables - Training Set:
	Prediction Time: 0.713s
	Accuracy Score:  0.9198352996312893
	F1 Score:  0.013534984993820987
	AUROC Score:  0.7372722139860943
	Confusion Matrix:
Baseline Experiment with 49 Variables - Validation Set:
	Prediction Time: 0.439s
	Accuracy Score:  0.9164068328327946
	F1 Score:  0.015322217214961693
	AUROC Score:  0.7379010540956517
	Confusion Matrix:
Baseline Experiment with 49 Variables - Test Set:
	Prediction Time: 0.189s
	Accuracy Score:  0.9190062219524356
	F1 Score:  0.01059322033898305
	AUROC Score:  0.7367969785452071
	Confusion Matrix:
In [33]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs_17,
                           len(selected_num_features),
                           len(selected_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                         
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]


display(expLog)


exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...

1 rows × 22 columns

Baseline Experiment (Level 1 Pipeline) with all 120 Input Features in application_train¶

In [34]:
# Input Features excluding SK_ID_CURR and TARGET
all_num_features = train.describe().columns.to_list()
all_cat_features = set(train.columns.to_list()) - set(all_num_features)
all_cat_features = list(all_cat_features)

all_num_features.remove('SK_ID_CURR') #ID has no effect on ability to repay loans
all_num_features.remove('TARGET') 
In [35]:
#features_pipeline to combine Numerical and Categorical Pipelines of all features
data_pipeline_120 = ColumnTransformer(
    transformers= [
        ('num', num_pipeline, all_num_features), 
        ('cat', cat_pipeline, all_cat_features)],
        remainder='drop',
        n_jobs=-1
    )

# Baseline Experiment with 120 Input Vars
baseline_pipeline_120 = Pipeline([
        ("preparation", data_pipeline_120),
        ("logRegression", LogisticRegression())
    ])

#Name of Experiment
exp_name = "Baseline 2, LogReg with original 120 Features"

#Description of Experiments
description = 'Baseline 2 LogReg Model with Num and Cat Features.'
In [36]:
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time= train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_120)
X train           shape: (209107, 120)
X validation      shape: (52277, 120)
X test            shape: (46127, 120)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'logRegression']
	# Total Features:  120

Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
	# Numerical Features:  104

Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
	# Categorical Features:  16

training in progress...

Baseline Experiment with Original 120 Input Variables - Training Time: 4.948s

Prediction and Scoring¶

In [37]:
X_train_transformed_120 = data_pipeline_120.fit_transform(X_train)
total_inputs_120 = X_train_transformed_120.shape[1]

# Training Set
print(f"Baseline Experiment Training Set with {total_inputs_120} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"Baseline Experiment Validation Set with {total_inputs_120} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"Baseline Experiment Test Set with {total_inputs_120} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Baseline Experiment Training Set with 250 Input Features:
	Prediction Time: 1.188s
	Accuracy Score:  0.9199548556480653
	F1 Score:  0.021512919443470127
	AUROC Score:  0.745932186550156
	Confusion Matrix:
Baseline Experiment Validation Set with 250 Input Features:
	Prediction Time: 0.360s
	Accuracy Score:  0.9163303173479733
	F1 Score:  0.020161290322580648
	AUROC Score:  0.7463851151424112
	Confusion Matrix:
Baseline Experiment Test Set with 250 Input Features:
	Prediction Time: 0.321s
	Accuracy Score:  0.9193314111041255
	F1 Score:  0.024127983215316024
	AUROC Score:  0.7429677570702033
	Confusion Matrix:
In [38]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs_120,
                           len(all_num_features),
                           len(all_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                         
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]

display(expLog)

exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...

2 rows × 22 columns

LogReg Experiment with L1 Penalty with 17 Selected Input Features¶

In [39]:
# LogReg Experiment with L1 Penalty (L2 is default)
L1_pipeline_17 = Pipeline([
        ("preparation", data_pipeline_17),
        ("lassoRegression", LogisticRegression(penalty='l1', solver = 'saga'))
    ])

#Name of Experiment
exp_name = "LogReg - L1 Penalty with Selected 17 Features"

#Description of Experiments
description = 'LogReg Model-L1 Penalty with Selected 17 Cat + Num Features.'

features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, L1_pipeline_17)
X train           shape: (209107, 17)
X validation      shape: (52277, 17)
X test            shape: (46127, 17)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'lassoRegression']
	# Total Features:  17

Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
	# Numerical Features:  10

Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
	# Categorical Features:  7

training in progress...

Baseline Experiment with Original 17 Input Variables - Training Time: 15.145s

Prediction and Scoring¶

In [40]:
# Training Set
print(f"LogReg - L1 Penalty Training Set with {total_inputs_17} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"LogReg - L1 Penalty Validation Set with {total_inputs_17} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"LogReg - L1 Penalty Test Set with {total_inputs_17} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
LogReg - L1 Penalty Training Set with 49 Input Features:
	Prediction Time: 0.468s
	Accuracy Score:  0.9198400818719603
	F1 Score:  0.0136518771331058
	AUROC Score:  0.7372155400141124
	Confusion Matrix:
LogReg - L1 Penalty Validation Set with 49 Input Features:
	Prediction Time: 0.210s
	Accuracy Score:  0.9164450905752052
	F1 Score:  0.016216216216216217
	AUROC Score:  0.7379875452540834
	Confusion Matrix:
LogReg - L1 Penalty Test Set with 49 Input Features:
	Prediction Time: 0.191s
	Accuracy Score:  0.919027901229215
	F1 Score:  0.011119936457505957
	AUROC Score:  0.7369082524977961
	Confusion Matrix:
In [41]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs_17,
                           len(selected_num_features),
                           len(selected_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                         
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]

display(expLog)

exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...

3 rows × 22 columns

LogReg Experiment with L1 Penalty - all 120 Input Features¶

In [42]:
# LogReg Experiment with L1 Penalty (L2 is default)
L1_pipeline_120 = Pipeline([
        ("preparation", data_pipeline_120),
        ("lassoRegression", LogisticRegression(penalty='l1', solver = 'saga'))
    ])

#Name of Experiment
exp_name = "LogReg - L1 Penalty with 120 Features"

#Description of Experiments
description = 'LogReg Model-L1 Penalty with 104 Num + 16 Cat Features.'

features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, L1_pipeline_120)
X train           shape: (209107, 120)
X validation      shape: (52277, 120)
X test            shape: (46127, 120)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'lassoRegression']
	# Total Features:  120

Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
	# Numerical Features:  104

Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
	# Categorical Features:  16

training in progress...

Baseline Experiment with Original 120 Input Variables - Training Time: 60.468s

Prediction and Scoring¶

In [43]:
# Training Set
print(f"LogReg - L1 Penalty Training Set with {total_inputs_120} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"LogReg - L1 Penalty Validation Set with {total_inputs_120} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"LogReg - L1 Penalty Test Set with {total_inputs_120} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
LogReg - L1 Penalty Training Set with 250 Input Features:
	Prediction Time: 1.212s
	Accuracy Score:  0.9198735575566576
	F1 Score:  0.016898433374405917
	AUROC Score:  0.7440194523416428
	Confusion Matrix:
LogReg - L1 Penalty Validation Set with 250 Input Features:
	Prediction Time: 0.351s
	Accuracy Score:  0.916311188476768
	F1 Score:  0.013528748590755355
	AUROC Score:  0.745327526606072
	Confusion Matrix:
LogReg - L1 Penalty Test Set with 250 Input Features:
	Prediction Time: 0.323s
	Accuracy Score:  0.9192013354434496
	F1 Score:  0.017918313570487485
	AUROC Score:  0.7427405148645342
	Confusion Matrix:
In [44]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs_120,
                           len(all_num_features),
                           len(all_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                         
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]

display(expLog)

exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.92 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...

4 rows × 22 columns

Log Reg with all 120 Inputs + New Debt_to_Income_Ratio Feature¶

In [45]:
# All 120 Input Features Plus New Feature Transformation in Pipeline: Debt-to-Income Ratio

from sklearn.base import BaseEstimator, TransformerMixin

class Debt_to_Income_Ratio(BaseEstimator, TransformerMixin):
    def __init__(self, features=None): # no *args or **kargs
        self.features = features
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        df = pd.DataFrame(X.copy(), columns=self.features)  # select a subset of columns in X based on self.features 

        feature1 = 'AMT_CREDIT'
        feature2 = 'AMT_INCOME_TOTAL'
        
        #Create new column for debt-to-income ratio
        df['DEBT_TO_INCOME_RATIO'] = df[feature1] / df[feature2] 
        
        #Drop the features that were initially passed
        df.drop(feature1, axis=1, inplace=True)
        df.drop(feature2, axis=1, inplace=True)
        
        #Return df
        return df
    
    
test_pipeline = make_pipeline(Debt_to_Income_Ratio())
debt_income_ratio = test_pipeline.fit_transform (X_train[['AMT_CREDIT', 'AMT_INCOME_TOTAL']])
display(pd.DataFrame(np.c_[X_train[['AMT_CREDIT', 'AMT_INCOME_TOTAL']],debt_income_ratio], columns=['AMT_CREDIT', 'AMT_INCOME_TOTAL'] + ["DEBT_INCOME_RATIO"]) ) 
AMT_CREDIT AMT_INCOME_TOTAL DEBT_INCOME_RATIO
0 540000.0 144000.0 3.750000
1 1762110.0 225000.0 7.831600
2 161730.0 135000.0 1.198000
3 270000.0 67500.0 4.000000
4 1381113.0 202500.0 6.820311
... ... ... ...
209102 1762110.0 270000.0 6.526333
209103 284400.0 112500.0 2.528000
209104 180000.0 45000.0 4.000000
209105 1736937.0 202500.0 8.577467
209106 157500.0 58500.0 2.692308

209107 rows × 3 columns

In [46]:
data_pipeline_DIR_120 = ColumnTransformer( 
    transformers= [
        # (name, transformer,     columns)
        ('num', num_pipeline, all_num_features),
        ('cat', cat_pipeline, all_cat_features),
        ('DIR', make_pipeline(Debt_to_Income_Ratio(), StandardScaler()), ['AMT_CREDIT', 'AMT_INCOME_TOTAL'])
          
    ],
        remainder='drop',
        n_jobs=-1
    )

X_train_transformed = data_pipeline_DIR_120.fit_transform(X_train)
column_names = all_num_features  + \
               list(data_pipeline_DIR_120.transformers_[1][1].named_steps["onehot"].get_feature_names(all_cat_features)) +\
                ['DEBT_TO_INCOME_RATIO']

display(pd.DataFrame(X_train_transformed,  columns=column_names).head())
number_of_inputs = X_train_transformed.shape[1]
CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH ... FLAG_OWN_CAR_Y WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden WALLSMATERIAL_MODE_missing DEBT_TO_INCOME_RATIO
0 2.763729 1.355273 1.339199 2.022258 1.459046 1.496970 -2.541537 -0.020777 -1.388585 -1.504414 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -0.077059
1 1.381865 2.117614 4.370030 3.208587 4.255552 1.778614 -2.978958 -0.021845 -0.473084 -2.666495 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.437253
2 0.000000 1.270568 0.401090 0.785916 0.364762 0.733344 -5.520724 2.583799 -3.407287 -2.883019 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.023874
3 0.000000 0.635284 0.669600 0.961737 0.729523 2.222725 -4.295578 -0.034366 -2.235083 -1.503752 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.015694
4 0.000000 1.905853 3.425158 2.630799 3.258537 2.222725 -2.070415 -0.010463 -1.092978 -0.992569 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.062055

5 rows × 251 columns

In [47]:
# Baseline Experiment with 120 Input Vars
baseline_pipeline_DIR_120 = Pipeline([
        ("preparation", data_pipeline_DIR_120),
        ("logRegression", LogisticRegression())
    ])

#Name of Experiment)
exp_name = "LogReg with Num and Cat Features + Debt_Income_Ratio"

#Description of Experiments
description =" Logistic Regression Model with Original 120 Num and Cat Features + Debt-Income-Ratio."
In [48]:
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_DIR_120)
X train           shape: (209107, 120)
X validation      shape: (52277, 120)
X test            shape: (46127, 120)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'logRegression']
	# Total Features:  120

Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
	# Numerical Features:  104

Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
	# Categorical Features:  16

training in progress...

Baseline Experiment with Original 120 Input Variables - Training Time: 5.076s

Prediction and Scoring¶

In [49]:
total_inputs = X_train_transformed.shape[1]

# Training Set
print(f"Training Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"Validation Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"Test Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Training Set with all 120 input features + Added Debt-Income-Ratio Feature:
	Prediction Time: 1.156s
	Accuracy Score:  0.9198352996312893
	F1 Score:  0.020566754309085597
	AUROC Score:  0.7453773395654856
	Confusion Matrix:
Validation Set with all 120 input features + Added Debt-Income-Ratio Feature:
	Prediction Time: 0.350s
	Accuracy Score:  0.9161198997647149
	F1 Score:  0.01791713325867861
	AUROC Score:  0.7456796517828043
	Confusion Matrix:
Test Set with all 120 input features + Added Debt-Income-Ratio Feature:
	Prediction Time: 0.326s
	Accuracy Score:  0.9192013354434496
	F1 Score:  0.02255441909257802
	AUROC Score:  0.7430448006911027
	Confusion Matrix:
In [50]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs,
                           len(all_num_features),
                           len(all_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc, 3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1, 3),
                           round(train_auroc, 3),                          
                           round(valid_auroc, 3),                                                     
                           round(test_auroc,3),                           
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]

display(expLog)

exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.92 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.92 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...

5 rows × 22 columns

Log Reg with Selected 17 Inputs + New Debt_to_Income_Ratio Feature¶

In [51]:
data_pipeline_DIR_17 = ColumnTransformer( 
    transformers= [
        # (name, transformer,     columns)
        ('num', num_pipeline, selected_num_features),
        ('cat', cat_pipeline, selected_cat_features),
        ('DIR', make_pipeline(Debt_to_Income_Ratio(), StandardScaler()), ['AMT_CREDIT', 'AMT_INCOME_TOTAL'])
          
    ],
        remainder='drop',
        n_jobs=-1
    )

baseline_pipeline_DIR_17 = Pipeline([
        ("preparation", data_pipeline_DIR_17),
        ("logRegression", LogisticRegression())
    ])


X_train_transformed = data_pipeline_DIR_17.fit_transform(X_train)
total_inputs = X_train_transformed.shape[1]
In [52]:
#Name of Experiment)
exp_name = "LogReg with Num and Cat Features + Debt_Income_Ratio"

#Description of Experiments
description =" Logistic Regression Model with Original 17 Num and Cat Features + Debt-Income-Ratio."

features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_DIR_17)
X train           shape: (209107, 120)
X validation      shape: (52277, 120)
X test            shape: (46127, 120)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'logRegression']
	# Total Features:  120

Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
	# Numerical Features:  104

Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
	# Categorical Features:  16

training in progress...

Baseline Experiment with Original 120 Input Variables - Training Time: 2.117s

Prediction and Scoring¶

In [53]:
# Training Set
print(f"Training Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

# Validation Set
print(f"Validation Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

# Test Set
print(f"Test Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Training Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:
	Prediction Time: 0.463s
	Accuracy Score:  0.9198592108346445
	F1 Score:  0.014235294117647058
	AUROC Score:  0.7377964403845755
	Confusion Matrix:
Validation Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:
	Prediction Time: 0.200s
	Accuracy Score:  0.9164450905752052
	F1 Score:  0.016216216216216217
	AUROC Score:  0.7379073053753373
	Confusion Matrix:
Test Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:
	Prediction Time: 0.184s
	Accuracy Score:  0.9190712597827736
	F1 Score:  0.011649457241196717
	AUROC Score:  0.7373531897169191
	Confusion Matrix:
In [54]:
expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs,
                           len(all_num_features),
                           len(all_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc, 3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1, 3),
                           round(train_auroc, 3),                          
                           round(valid_auroc, 3),                                                     
                           round(test_auroc,3),                           
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]

display(expLog)

exp_count += 1
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.92 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.92 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...
5 6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 2.1173 0.463066 0.200407 0.184219 N/A N/A N/A Logistic Regression Model with Original 17 Nu...

6 rows × 22 columns

Other Experiments Using 17 Selected Input Features:¶

In [55]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
In [56]:
clf_names = ["Random Forest",
#              "SVC"
]

clfs = [RandomForestClassifier(n_jobs=-1, class_weight='balanced'),
#       SVC()
]

for clf_name, clf in zip(clf_names, clfs): 
    
    print("-----------------------------------------------------")
    print(f"{clf_name.upper()}")
    print("-----------------------------------------------------")
    pipe = Pipeline([
        ("preparation", data_pipeline_17),
        ("clf", clf),
    ])
    
    # Name of Experiment
    exp_name = clf_name +" with 17 Features"
    
    # Description of Experiment
    description = f'{clf_name} Model with 10 Num + 7 Cat Features.'
    
    features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time= train_model(train, exp_name, selected_num_features, selected_cat_features, pipe)
    
    
    # Training Set
    print("Baseline Experiment with 17 Variables - Training Set:")
    cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

    # Validation Set
    print("Baseline Experiment with 17 Variables - Validation Set:")
    cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

    # Test Set
    print("Baseline Experiment with 17 Variables - Test Set:")
    cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
    
    expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           total_inputs_17,
                           len(selected_num_features),
                           len(selected_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                               
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]
    exp_count += 1
    
-----------------------------------------------------
RANDOM FOREST
-----------------------------------------------------
X train           shape: (209107, 17)
X validation      shape: (52277, 17)
X test            shape: (46127, 17)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'clf']
	# Total Features:  17

Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
	# Numerical Features:  10

Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
	# Categorical Features:  7

training in progress...

Baseline Experiment with Original 17 Input Variables - Training Time: 4.923s
Baseline Experiment with 17 Variables - Training Set:
	Prediction Time: 0.840s
	Accuracy Score:  0.9999713065559738
	F1 Score:  0.9998207242739333
	AUROC Score:  1.0
	Confusion Matrix:
Baseline Experiment with 17 Variables - Validation Set:
	Prediction Time: 0.325s
	Accuracy Score:  0.9163685750903839
	F1 Score:  0.004100227790432802
	AUROC Score:  0.7195105509709813
	Confusion Matrix:
Baseline Experiment with 17 Variables - Test Set:
	Prediction Time: 0.295s
	Accuracy Score:  0.9195265245951395
	F1 Score:  0.006955591225254147
	AUROC Score:  0.7205793058613987
	Confusion Matrix:
In [57]:
display(expLog)
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.92 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.92 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...
5 6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.92 0.916 0.919 0.014 0.016 ... 0.738 0.737 2.1173 0.463066 0.200407 0.184219 N/A N/A N/A Logistic Regression Model with Original 17 Nu...
6 7 Random Forest with 17 Features 49 10 7 1.00 0.916 0.920 1.000 0.004 ... 0.720 0.721 4.9230 0.839898 0.325235 0.294644 N/A N/A N/A Random Forest Model with 10 Num + 7 Cat Features.

7 rows × 22 columns

Gradboost experiment¶

In [58]:
from sklearn.experimental import enable_hist_gradient_boosting  
from sklearn.ensemble import HistGradientBoostingClassifier
In [59]:
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None,):
        return X.todense()
In [60]:
clf_names = ["Gradboost",
#              "SVC"
]

clfs = [HistGradientBoostingClassifier()
#       SVC()
]

for clf_name, clf in zip(clf_names, clfs): 
    
    print("-----------------------------------------------------")
    print(f"{clf_name.upper()}")
    print("-----------------------------------------------------")
    pipe = Pipeline([
        ("preparation", data_pipeline_17),
        #("to_dense", DenseTransformer()),
        ("clf", clf)
    ])
    
    # Name of Experiment
    exp_name = clf_name +" with 17 Features"
    
    # Description of Experiment
    description = f'{clf_name} Model with 10 Num + 7 Cat Features.'
    
    features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, pipe)
    
    
    # Training Set
    print("Baseline Experiment with 17 Variables - Training Set:")
    cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

    # Validation Set
    print("Baseline Experiment with 17 Variables - Validation Set:")
    cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

    # Test Set
    print("Baseline Experiment with 17 Variables - Test Set:")
    cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
    
    expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           len(features),
                           len(selected_num_features),
                           len(selected_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                               
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]
    exp_count += 1
    

    
display(expLog)
-----------------------------------------------------
GRADBOOST
-----------------------------------------------------
X train           shape: (209107, 17)
X validation      shape: (52277, 17)
X test            shape: (46127, 17)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'clf']
	# Total Features:  17

Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
	# Numerical Features:  10

Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
	# Categorical Features:  7

training in progress...

Baseline Experiment with Original 17 Input Variables - Training Time: 1.821s
Baseline Experiment with 17 Variables - Training Set:
	Prediction Time: 0.591s
	Accuracy Score:  0.9207056674334193
	F1 Score:  0.030974227105370813
	AUROC Score:  0.7740722497994058
	Confusion Matrix:
Baseline Experiment with 17 Variables - Validation Set:
	Prediction Time: 0.261s
	Accuracy Score:  0.9166555081584635
	F1 Score:  0.021997755331088664
	AUROC Score:  0.7464147311171788
	Confusion Matrix:
Baseline Experiment with 17 Variables - Test Set:
	Prediction Time: 0.235s
	Accuracy Score:  0.9197433173629328
	F1 Score:  0.024248813916710594
	AUROC Score:  0.7458278199091248
	Confusion Matrix:
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.920 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.920 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.920 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.920 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...
5 6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 2.1173 0.463066 0.200407 0.184219 N/A N/A N/A Logistic Regression Model with Original 17 Nu...
6 7 Random Forest with 17 Features 49 10 7 1.000 0.916 0.920 1.000 0.004 ... 0.720 0.721 4.9230 0.839898 0.325235 0.294644 N/A N/A N/A Random Forest Model with 10 Num + 7 Cat Features.
7 8 Gradboost with 17 Features 17 10 7 0.921 0.917 0.920 0.031 0.022 ... 0.746 0.746 1.8211 0.591207 0.261072 0.234816 N/A N/A N/A Gradboost Model with 10 Num + 7 Cat Features.

8 rows × 22 columns

In [61]:
clf_names = ["Gradboost",
#              "SVC"
]

clfs = [HistGradientBoostingClassifier()
#       SVC()
]

for clf_name, clf in zip(clf_names, clfs): 
    
    print("-----------------------------------------------------")
    print(f"{clf_name.upper()}")
    print("-----------------------------------------------------")
    pipe = Pipeline([
        ("preparation", data_pipeline_120),
        ("to_dense", DenseTransformer()),
        ("clf", clf)
    ])
    
    # Name of Experiment
    exp_name = clf_name +" with 120 Features"
    
    # Description of Experiment
    description = f'{clf_name} Model with 104 Num + 16 Cat Features.'
    
    features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_120)
    
    
    
    # Training Set
    print("Baseline Experiment with 120 Variables - Training Set:")
    cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

    # Validation Set
    print("Baseline Experiment with 120 Variables - Validation Set:")
    cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

    # Test Set
    print("Baseline Experiment with 120 Variables - Test Set:")
    cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
    
    expLog.loc[len(expLog)] = [exp_count, 
                           exp_name, 
                           len(features),
                           len(selected_num_features),
                           len(selected_cat_features),
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc,3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1,3),
                           round(train_auroc, 3), 
                           round(valid_auroc, 3),                                                      
                           round(test_auroc,3),                               
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           "N/A",
                           "N/A",
                           "N/A",
                           description]
    exp_count += 1
    

    
display(expLog)
-----------------------------------------------------
GRADBOOST
-----------------------------------------------------
X train           shape: (209107, 120)
X validation      shape: (52277, 120)
X test            shape: (46127, 120)

PERFORMING TRAINING: {exp_name}
	Pipeline: ['preparation', 'logRegression']
	# Total Features:  120

Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
	# Numerical Features:  104

Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
	# Categorical Features:  16

training in progress...

Baseline Experiment with Original 120 Input Variables - Training Time: 5.113s
Baseline Experiment with 120 Variables - Training Set:
	Prediction Time: 1.147s
	Accuracy Score:  0.9199548556480653
	F1 Score:  0.021512919443470127
	AUROC Score:  0.745932186550156
	Confusion Matrix:
Baseline Experiment with 120 Variables - Validation Set:
	Prediction Time: 0.364s
	Accuracy Score:  0.9163303173479733
	F1 Score:  0.020161290322580648
	AUROC Score:  0.7463851151424112
	Confusion Matrix:
Baseline Experiment with 120 Variables - Test Set:
	Prediction Time: 0.324s
	Accuracy Score:  0.9193314111041255
	F1 Score:  0.024127983215316024
	AUROC Score:  0.7429677570702033
	Confusion Matrix:
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.920 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.920 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.920 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.920 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...
5 6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 2.1173 0.463066 0.200407 0.184219 N/A N/A N/A Logistic Regression Model with Original 17 Nu...
6 7 Random Forest with 17 Features 49 10 7 1.000 0.916 0.920 1.000 0.004 ... 0.720 0.721 4.9230 0.839898 0.325235 0.294644 N/A N/A N/A Random Forest Model with 10 Num + 7 Cat Features.
7 8 Gradboost with 17 Features 17 10 7 0.921 0.917 0.920 0.031 0.022 ... 0.746 0.746 1.8211 0.591207 0.261072 0.234816 N/A N/A N/A Gradboost Model with 10 Num + 7 Cat Features.
8 9 Gradboost with 120 Features 120 10 7 0.920 0.916 0.919 0.022 0.020 ... 0.746 0.743 5.1127 1.147315 0.364037 0.323979 N/A N/A N/A Gradboost Model with 104 Num + 16 Cat Features.

9 rows × 22 columns

Hyperparameter Tuning¶

In [62]:
clf_best_parameters = {}

# Function to run GridSearchCV and log experiments
def gs_classifier(in_features, clf_name, clf, parameters, expCount):
    y = train['TARGET']
    X = train[in_features]
    total_selected_inputs = len(in_features)
    
    numerical_features = X.describe().columns.to_list()
    total_num_inputs = len(numerical_features)
    
    categorical_features = set(X.columns.to_list()) - set(numerical_features)
    categorical_features = list(categorical_features)
    total_cat_inputs = len(categorical_features)
    
       
    description = f'{clf_name} with {total_selected_inputs}'
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    
    print(f"X train           shape: {X_train.shape}")
    print(f"X validation      shape: {X_valid.shape}")
    print(f"X test            shape: {X_test.shape}")
    
    
    data_pipeline = ColumnTransformer(transformers=[
            ("num_pipeline", num_pipeline, numerical_features),
            ("cat_pipeline", cat_pipeline, categorical_features)],
            remainder='drop',
            n_jobs=-1
        )
    
    clf_pipeline = Pipeline([
            ("preparation", data_pipeline),# combination of numerical, categorical subpipelines
            ("clf", clf)  # classifier estimator you are using
        ])

    gs = GridSearchCV(clf_pipeline,
                      parameters,
                      scoring=['f1','roc_auc'],
                      cv=3,
                      refit='roc_auc',
                      n_jobs=-1,
                      verbose=1)

    print("\nPERFORMING GRID SEARCH FOR {}...".format(clf_name.upper()))
    print("\tpipeline:",[name for name, _ in clf_pipeline.steps])
    print("\tparameters:", parameters)
    print()

    start = time()
    gs.fit(X_train, y_train)

    train_time = time() - start
    print("\tTraining Time: %0.3fs" % (time() - start))
    print()
    
    # Training Set
    print(f"{clf_name} Training Set with {total_selected_inputs} Input Features:")
    cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')

    # Validation Set
    print(f"{clf_name} Validation Set with {total_selected_inputs} Input Features:")
    cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')

    # Test Set
    print(f"{clf_name} Experiment Test Set with {total_selected_inputs} Input Features:")
    cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')


    print("\n\tBest score: %0.3f" % gs.best_score_)
    print("\tBest parameters set:")
    best_parameters = gs.best_estimator_.get_params()

    best_parameters_dict = {}
    for param_name in sorted(parameters.keys()):
        print("\t\t%s: %r" % (param_name, best_parameters[param_name]))
        best_parameters_dict[param_name] = best_parameters[param_name]
        clf_best_parameters[clf_name] = best_parameters_dict    
    print()
    print()

    expLog.loc[len(expLog)] = [exp_count, 
                           clf_name, 
                           total_selected_inputs,
                           total_num_inputs,
                           total_cat_inputs,
                           round(train_acc, 3), 
                           round(valid_acc, 3),
                           round(test_acc, 3),
                           round(train_f1, 3), 
                           round(valid_f1, 3),
                           round(test_f1, 3),
                           round(train_auroc, 3),                          
                           round(valid_auroc, 3),                                                     
                           round(test_auroc,3), 
                           train_time, 
                           pred_time_train,
                           pred_time_valid,
                           pred_time_test, 
                           parameters,
                           best_parameters_dict,
                           round(gs.best_score_,3),
                           description]
    
    exp_count += 1
In [63]:
# Grid Search over Preparation Pipeline and Classifiers

clf_names = ["Random Forest",
#              "Logistic Regression", 
#              "SVC",
]

estimators = [RandomForestClassifier(),
#               LogisticRegression(solver='saga'),
#              SVC(),
]

param_grids = [{'clf__n_estimators':[300,500],
                'clf__max_features':['sqrt','log2',None],
                },
                       
#                 'clf__C': [1.0, 10.0, 100.0, 1000.0, 10000.0],
#                 'clf__penalty':[None, 'l1','l2']},
#               {'clf__C': [0.001, 0.01, 0.1, 1.], 
#                'clf__kernel': ["linear", "poly", "rbf", "sigmoid"],
#                'clf__gamma':["scale", "auto"]}
]


selected_features = selected_num_features + selected_cat_features
expCount = 1
for clf_name, clf, parameters in zip(clf_names, estimators, param_grids): 
    gs_classifier(selected_features, clf_name, clf, parameters, expCount)
    expCount += 1
X train           shape: (209107, 17)
X validation      shape: (52277, 17)
X test            shape: (46127, 17)

PERFORMING GRID SEARCH FOR RANDOM FOREST...
	pipeline: ['preparation', 'clf']
	parameters: {'clf__n_estimators': [300, 500], 'clf__max_features': ['sqrt', 'log2', None]}

Fitting 3 folds for each of 6 candidates, totalling 18 fits
	Training Time: 938.274s

Random Forest Training Set with 17 Input Features:
---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
/usr/local/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    821             try:
--> 822                 tasks = self._ready_batches.get(block=False)
    823             except queue.Empty:

/usr/local/lib/python3.9/queue.py in get(self, block, timeout)
    167                 if not self._qsize():
--> 168                     raise Empty
    169             elif timeout is None:

Empty: 

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-63-510d7f5677fb> in <module>
     26 expCount = 1
     27 for clf_name, clf, parameters in zip(clf_names, estimators, param_grids):
---> 28     gs_classifier(selected_features, clf_name, clf, parameters, expCount)
     29     expCount += 1

<ipython-input-62-884142a3426c> in gs_classifier(in_features, clf_name, clf, parameters, expCount)
     59     # Training Set
     60     print(f"{clf_name} Training Set with {total_selected_inputs} Input Features:")
---> 61     cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
     62 
     63     # Validation Set

<ipython-input-28-1f0a222ce133> in predict_and_score(X, y, model, model_ID)
      4 def predict_and_score(X, y, model, model_ID):
      5     start = time()
----> 6     y_pred = model.predict(X)
      7     pred_time = time() - start
      8 

/usr/local/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    111 
    112             # lambda, but not partial, allows help() to work with update_wrapper
--> 113             out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
    114         else:
    115 

/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    467         Xt = X
    468         for _, name, transform in self._iter(with_final=False):
--> 469             Xt = transform.transform(Xt)
    470         return self.steps[-1][1].predict(Xt, **predict_params)
    471 

/usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    746             self._check_n_features(X, reset=False)
    747 
--> 748         Xs = self._fit_transform(
    749             X,
    750             None,

/usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted, column_as_strings)
    604         )
    605         try:
--> 606             return Parallel(n_jobs=self.n_jobs)(
    607                 delayed(func)(
    608                     transformer=clone(trans) if not fitted else trans,

/usr/local/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1041             # remaining jobs.
   1042             self._iterating = False
-> 1043             if self.dispatch_one_batch(iterator):
   1044                 self._iterating = self._original_iterator is not None
   1045 

/usr/local/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    831                 big_batch_size = batch_size * n_jobs
    832 
--> 833                 islice = list(itertools.islice(iterator, big_batch_size))
    834                 if len(islice) == 0:
    835                     return False

/usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in <genexpr>(.0)
    607                 delayed(func)(
    608                     transformer=clone(trans) if not fitted else trans,
--> 609                     X=_safe_indexing(X, column, axis=1),
    610                     y=y,
    611                     weight=weight,

/usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py in _safe_indexing(X, indices, axis)
    374 
    375     if hasattr(X, "iloc"):
--> 376         return _pandas_indexing(X, indices, indices_dtype, axis=axis)
    377     elif hasattr(X, "shape"):
    378         return _array_indexing(X, indices, indices_dtype, axis=axis)

/usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py in _pandas_indexing(X, key, key_dtype, axis)
    220         # check whether we should index with loc or iloc
    221         indexer = X.iloc if key_dtype == "int" else X.loc
--> 222         return indexer[:, key] if axis else indexer[key]
    223 
    224 

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1107             return self._multi_take(tup)
   1108 
-> 1109         return self._getitem_tuple_same_dim(tup)
   1110 
   1111     def _get_label(self, label, axis: int):

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple_same_dim(self, tup)
    804                 continue
    805 
--> 806             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    807             # We should never have retval.ndim < self.ndim, as that should
    808             #  be handled by the _getitem_lowerdim call above.

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: "['CNT_CHILDREN', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] not in index"
In [64]:
display(expLog)
Experiment Number Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
0 1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.920 0.916 0.919 0.014 0.015 ... 0.738 0.737 2.9348 0.713401 0.438845 0.189358 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
1 2 Baseline 2, LogReg with original 120 Features 250 104 16 0.920 0.916 0.919 0.022 0.020 ... 0.746 0.743 4.9481 1.188051 0.359550 0.321185 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
2 3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 15.1447 0.467919 0.210027 0.190774 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
3 4 LogReg - L1 Penalty with 120 Features 250 104 16 0.920 0.916 0.919 0.017 0.014 ... 0.745 0.743 60.4684 1.212351 0.351223 0.323352 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
4 5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.920 0.916 0.919 0.021 0.018 ... 0.746 0.743 5.0762 1.155572 0.350079 0.325508 N/A N/A N/A Logistic Regression Model with Original 120 N...
5 6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.920 0.916 0.919 0.014 0.016 ... 0.738 0.737 2.1173 0.463066 0.200407 0.184219 N/A N/A N/A Logistic Regression Model with Original 17 Nu...
6 7 Random Forest with 17 Features 49 10 7 1.000 0.916 0.920 1.000 0.004 ... 0.720 0.721 4.9230 0.839898 0.325235 0.294644 N/A N/A N/A Random Forest Model with 10 Num + 7 Cat Features.
7 8 Gradboost with 17 Features 17 10 7 0.921 0.917 0.920 0.031 0.022 ... 0.746 0.746 1.8211 0.591207 0.261072 0.234816 N/A N/A N/A Gradboost Model with 10 Num + 7 Cat Features.
8 9 Gradboost with 120 Features 120 10 7 0.920 0.916 0.919 0.022 0.020 ... 0.746 0.743 5.1127 1.147315 0.364037 0.323979 N/A N/A N/A Gradboost Model with 104 Num + 16 Cat Features.

9 rows × 22 columns

In [63]:
# Function Build Barcharts of scores for all models
acc_df = expLog[['Model', 'Train Acc', 'Valid Acc', 'Test Acc']].copy()
F1_df = expLog[['Model','Train F1', 'Valid F1', 'Test F1']].copy()
AUROC_df = expLog[['Model','Train AUROC', 'Valid AUROC', 'Test AUROC']].copy()

def score_barchart(df, title):
    # Plot the bar chart
    df.set_index('Model', inplace=True)
    ax = df.plot(kind='bar', figsize=(10, 6))
    plt.title(f'{title} Score Comparison')
    plt.ylabel(title)
    plt.xticks(rotation=90)
    plt.show()

Kaggle Submission File Prep¶

In [68]:
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
In [69]:
test_class_scores[0:10]
Out[69]:
array([0.06091203, 0.2319424 , 0.03704471, 0.03813419, 0.1346362 ,
       0.03247158, 0.02453353, 0.09834088, 0.01280437, 0.15194608])
In [70]:
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores

submit_df.head()
Out[70]:
SK_ID_CURR TARGET
0 100001 0.060912
1 100005 0.231942
2 100013 0.037045
3 100028 0.038134
4 100038 0.134636
In [71]:
submit_df.to_csv("submission.csv",index=False)

Write-up¶

Home Credit Default Risk (HCDR) Project - Phase 2¶

Team: GroupN_HCDR_1

Team Members:

  • Leona Naiki Chase
  • Nimish Misra
  • Jacob Shaw
  • Olga Vyrvich

Phase Leadership Plan

Phase Project Manager
1 Jacob
2 Leona
3 Olga
4 Nimish

Credit Assignment Plan

Overview:

Team and Plan Updates Presentation Slides Abstract Project Desctiption (Data and Tasks) EDA Visual EDA Modeling Pipelines Results and Discussion of Results Conclusion
Leona Leona Leona Nimish Jacob Jacob Olga Nimish Leona



Tasks:

Who What Time
Leona Create the phase2 notebook and update the Phase Leader Plan and Credit Assignment Plan 30 minutes
Nimish Describe the HCDR Dataset, identify the tasks to be tackled, and provide diagams to aid understanding of the workflow 1 hour
Jacob Run exploratory data analysis, including a data dictionary of the raw features, dataset size, summary statistics, correlation analysis, and other text-based analysis 1.5 hours
Jacob Run visual exploratory data analysis, including a visualization of each input and taret features, a visualization of the correlation analysis, and pair-wise visualization of the input and output features, a graphic summary of the missing analysis, etc. 1.5 hours
Olga Create a visualization of the modeling pipelines/subpipelines and identify familes of input features/count per family, and number of input features. 2 hours
Olga Record an experment log with details including the baseline experiment, families of input features used, Accuracy scores, and AUC/ROC scores. 1 hour
Leona After all other work is complete, create the abstract and conclusion to summarize at a high level the other work, and what the project will be 1 hour
Leona Create slides for the group's video presentation based on everyone's collective work 1 hour
Leona Review all work (including abstract and conclusion) and ensure professional appearance for the entirety of the phase2 notebook 1 hour
All Record a 2-minute video presentation about the project and our findings. The video will have a logical and scientific flow to it. 20 minutes

Abstract¶

In this project, our goal to create a machine learning (ML) model that can predict the likelihood of a borrower defaulting on their credit using unconventional data sources. The aim to provide lenders with a model that maximizes profits while minimizing the potential risk.In phase three of the project, we tackled the problem of predicting credit card default using machine learning techniques and focus on feature engineering and hyper tuning. Our main goal was to compare different models and identify the best score for this task.We conducted several experiments using different models and feature sets. The baseline experiments involved logistic regression models with 17 selected features and 120 original input features. These models achieved high accuracy and F1 scores. Next, we explored the use of L1 penalty in logistic regression models with selected features. We observed comparable results to the baseline models, suggesting that the L1 penalty did not improve performance.We also perfomed with random forest and gradient boosting models using selected features.Our observations suggest that gradient boosting with 17 selected features perfoms better then other models. This model achieved the highest accuracy, F1 score, and (AUROC) on both the validation and test sets.

In [ ]:
 

ML Pipelines¶

The pipelines used in this project were used to increase the efficiency and ease of understanding of our code. Our most basic pipelines (Level 3 Pipelines) were used to prepare our selected input feature data. The numerical features and categorical features were each handled in their own pipelines. Numerical data was standardized while categorical data was one-hot-encoded. In both sets, the pipelines handled imputing missing values. The Level 2 Pipeline, was a column transformer used to streamline the preparation of data before it is applied to our classifier, combining both numerical and categorical pipelines. Lasty, the Level 1 Pipeline was used to combine the Level 2 data preparation pipeline to the classifer model.

Our baseline pipelines:

  • Logistic Regression model with 17 selected input features from application_train.
  • Logistic Regression model with all 120 input features from application train (does not include SK_ID_CURR).
  • Logistic Regression model with both sets of input features using L1 penalty term.

We also ran experiments using Logisitic regression with a newly feature engineered, Debt to Income Ratio, using application_train features: 'AMT_CREDIT' and 'AMT_INCOME TOTAL'. Debt to Income Ratio is a good measure of ability to pay back loans, showing how much of a person's income will go out to pay debt. According to Wells Fargo.com (https://www.wellsfargo.com/goals-credit/smarter-credit/credit-101/debt-to-income-ratio/understanding-dti/#:~:text=35%25%20or%20less%3A%20Looking%20Good,a%20lower%20DTI%20as%20favorable.), A Debt-to-Income Ratio of 35% of less is considered good, 36-49% shows room for improvement, and over 50% show a need to take action.

We also ran different experiments using different algorithms such as a Random Forest model and Gradient Boosting Classifier.

Screenshot%202023-11-25%20at%204.22.31%E2%80%AFAM.png

Screenshot%202023-11-25%20at%204.22.51%E2%80%AFAM.png

Experiment Log¶

In [64]:
expLog.set_index('Experiment Number', inplace=True)
display(expLog)
Model # Transformed Input Features # Original Numerical Features # Original Categorical Features Train Acc Valid Acc Test Acc Train F1 Valid F1 Test F1 ... Valid AUROC Test AUROC Training Time Training Prediction Time Validation Prediction Time Test Prediction Time Hyperparameters Best Parameter Best Hypertuning Score Description
Experiment Number
1 Baseline 1, LogReg with Original 17 Selected F... 49 10 7 0.92 0.916 0.919 0.014 0.015 0.011 ... 0.738 0.737 5.7270 0.616539 0.235517 0.228772 N/A N/A N/A Baseline 1 LogReg Model with Preselected Num a...
2 Baseline 2, LogReg with original 120 Features 250 104 16 0.92 0.916 0.919 0.019 0.018 0.021 ... 0.746 0.743 10.8317 4.910362 0.653457 0.585086 N/A N/A N/A Baseline 2 LogReg Model with Num and Cat Featu...
3 LogReg - L1 Penalty with Selected 17 Features 49 10 7 0.92 0.916 0.919 0.014 0.016 0.011 ... 0.738 0.737 21.0944 0.607865 0.234333 0.215436 N/A N/A N/A LogReg Model-L1 Penalty with Selected 17 Cat +...
4 LogReg - L1 Penalty with 120 Features 250 104 16 0.92 0.916 0.919 0.017 0.014 0.018 ... 0.745 0.743 82.6300 1.839408 0.539958 0.517283 N/A N/A N/A LogReg Model-L1 Penalty with 104 Num + 16 Cat ...
5 LogReg with Num and Cat Features + Debt_Income... 251 104 16 0.92 0.916 0.919 0.021 0.018 0.023 ... 0.746 0.743 10.3708 2.619332 0.641288 0.518304 N/A N/A N/A Logistic Regression Model with Original 120 N...
6 LogReg with Num and Cat Features + Debt_Income... 50 104 16 0.92 0.916 0.919 0.014 0.016 0.012 ... 0.738 0.737 3.9352 0.593195 0.238655 0.220598 N/A N/A N/A Logistic Regression Model with Original 17 Nu...
7 Random Forest with 17 Features 49 10 7 1.00 0.916 0.920 1.000 0.005 0.007 ... 0.721 0.724 14.7829 3.057214 0.778230 0.787176 N/A N/A N/A Random Forest Model with 10 Num + 7 Cat Features.
8 Gradboost with 17 Features 17 10 7 0.92 0.916 0.920 0.023 0.014 0.022 ... 0.747 0.748 4.0657 1.022855 0.341756 0.315761 N/A N/A N/A Gradboost Model with 10 Num + 7 Cat Features.
9 Gradboost with 120 Features 120 10 7 0.92 0.916 0.919 0.019 0.018 0.021 ... 0.746 0.743 9.2200 2.100374 0.598318 0.507419 N/A N/A N/A Gradboost Model with 104 Num + 16 Cat Features.

9 rows × 21 columns

Loss Functions¶

Logistic function

$$ \sigma(t) = \dfrac{1}{1 + \exp(-t)} $$

Logistic Regression model prediction

$$ \hat{y} = \begin{cases} 0 & \text{if } \hat{p} < 0.5, \\ 1 & \text{if } \hat{p} \geq 0.5. \end{cases} $$

Cost function of a single training instanc

$$ c(\boldsymbol{\theta}) = \begin{cases} -\log(\hat{p}) & \text{if } y = 1, \\ -\log(1 - \hat{p}) & \text{if } y = 0. \end{cases} $$

Binary Cross-Entropy Loss (CXE)

Binary Cross Entropy loss, aka log loss, is a special case of negative log likelihood. It measures a classifier's performance, increases as the predicted probability moves farther from the true label. The goal in logistic regression is to minimize the CXE. $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} $$

LASSO Binary Cross Entropy (LBXE) $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} + \lambda \sum_{j=1}^{n}|w_j| $$

Ridge Binary Cross Entropy (RBXE) $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} + \lambda \sum_{j=1}^{n}w_j^2 $$

Primal Soft Margin SVM Classifier $$ {\displaystyle \underset{W,b, \zeta}{\text{argmin }}{\overbrace{\dfrac{1}{2}}^A \underbrace{\mathbf{w}^T \cdot \mathbf{w}}_B \quad + }C\sum _{i=1}^{m}\zeta _{i}} $$

Evaluation Metrics¶

Accuracy Score¶

$$ \text{Accuracy} = \cfrac{TP + TN}{TP + TN + FP + FN} $$
In [65]:
score_barchart(acc_df, "Accuracy")

F1 Score¶

$$ \text{precision} = \cfrac{TP}{TP + FP} $$$$\text{recall} = \cfrac{TP}{TP + FN}$$


$$ F_1 = \cfrac{2}{\cfrac{1}{\text{precision}} + \cfrac{1}{\text{recall}}} = 2 \times \cfrac{\text{precision}\, \times \, \text{recall}}{\text{precision}\, + \, \text{recall}} = \cfrac{TP}{TP + \cfrac{FN + FP}{2}} $$

In [66]:
score_barchart(F1_df, "F1")

Area Under the Receiver Operating Characteristics (AUROC)¶

$$\text{TPR (aka recall or specificity)} = \cfrac{TP}{TP + FN}$$


$$ \text{Specificity} = \cfrac{TN}{TN + FP} $$


$$ \text{FPR = 1 - Specificity} = \cfrac{FP}{TN + FP} $$

In [67]:
score_barchart(AUROC_df, "AUROC")

Conclusion¶

In Phase 3 of the HCDR Project, we used Home Credit's extensive data set to build baseline models to accurately predict whether a client with minimal credit history would pay back a loan. This real-world problem is highly relevant today in a society of rapidly growing wealth disparities. Using machine learning pipelines to preprocess and transform input features, we built several models that were evaluated by the performance metrics: accuracy score, F1 score, and AUROC.

Kaggle Submission File Prep¶

In [ ]:
 

References:

  • "List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?". Asked July 22, 2013 by Kyle Brandt. Stack Overflow, "https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas".
    Answered January 3, 2017 by Arun. Licensed under Creative Commons.
  • "Sort Correlation Matrix in Python. Accessed 11/23/23. geeksforgeeks.com, ("https://www.geeksforgeeks.org/sort-correlation-matrix-in-python/)
  • " What is a Good Debt-to-Income Ratio?". Accessed 11/20/2023. Wellsfargo.com, (https://www.wellsfargo.com/goals-credit/smarter-credit/credit-101/debt-to-income-ratio/understanding-dti/#:~:text=35%25%20or%20less%3A%20Looking%20Good,a%20lower%20DTI%20as%20favorable.)